Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, & Shane Legg (2018), References, Textbook of AI

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, & Shane Legg (2018)

arXiv:1811.07871.

URL: https://arxiv.org/abs/1811.07871

Abstract. DeepMind's research-direction paper on scalable oversight. Introduces recursive reward modelling (RRM) as the central proposal: decompose a complex task into sub-tasks each with their own learned reward model, decompose those further, and have humans supervise only at the leaves where the comparison problem is within human reach. The supervisory signal then propagates upward via the structure of the decomposition. RRM is one of the canonical scalable-oversight schemes, alongside debate and weak-to-strong generalisation, and underlies OpenAI's "summarising books with human feedback" line of work.

Tags: alignment safety scalable-oversight

Cited in:

Chapter 16: Ethics & Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Scalable agent alignment via reward modeling: a research direction