Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, & Shane Legg (2018)
arXiv:1811.07871.
URL: https://arxiv.org/abs/1811.07871
Abstract. DeepMind's research-direction paper on scalable oversight. Introduces recursive reward modelling (RRM) as the central proposal: decompose a complex task into sub-tasks each with their own learned reward model, decompose those further, and have humans supervise only at the leaves where the comparison problem is within human reach. The supervisory signal then propagates upward via the structure of the decomposition. RRM is one of the canonical scalable-oversight schemes, alongside debate and weak-to-strong generalisation, and underlies OpenAI's "summarising books with human feedback" line of work.
Tags: alignment safety scalable-oversight
Cited in: