Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, & Paul Christiano (2021), References, Textbook of AI

Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, & Paul Christiano (2021)

arXiv:2109.10862.

URL: https://arxiv.org/abs/2109.10862

Abstract. OpenAI's empirical implementation of recursive reward modelling. The authors fine-tune GPT-3 to summarise novel-length books by recursively decomposing the task: summarise short chunks, then summarise summaries, until a single book-level summary remains. Humans provide feedback only at the leaves, where the comparison problem is within human reach. The book-level summaries are evaluated by humans against the original books, with strong agreement. The paper is the first concrete demonstration that recursive task decomposition with leaf-level human feedback can solve a task too complex for direct human supervision.

Tags: alignment rlhf scalable-oversight

Cited in:

Chapter 16: Ethics & Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Recursively Summarizing Books with Human Feedback