References

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, & Shane Legg (2018)

arXiv:1811.07871.

URL: https://arxiv.org/abs/1811.07871

Abstract. DeepMind's research-direction paper on scalable oversight. Introduces recursive reward modelling (RRM) as the central proposal: decompose a complex task into sub-tasks each with their own learned reward model, decompose those further, and have humans supervise only at the leaves where the comparison problem is within human reach. The supervisory signal then propagates upward via the structure of the decomposition. RRM is one of the canonical scalable-oversight schemes, alongside debate and weak-to-strong generalisation, and underlies OpenAI's "summarising books with human feedback" line of work.

Tags: alignment safety scalable-oversight

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).