Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, & Paul Christiano (2021)
arXiv:2109.10862.
URL: https://arxiv.org/abs/2109.10862
Abstract. OpenAI's empirical implementation of recursive reward modelling. The authors fine-tune GPT-3 to summarise novel-length books by recursively decomposing the task: summarise short chunks, then summarise summaries, until a single book-level summary remains. Humans provide feedback only at the leaves, where the comparison problem is within human reach. The book-level summaries are evaluated by humans against the original books, with strong agreement. The paper is the first concrete demonstration that recursive task decomposition with leaf-level human feedback can solve a task too complex for direct human supervision.
Tags: alignment rlhf scalable-oversight
Cited in: