Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, & Dario Amodei (2017), References, Textbook of AI

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, & Dario Amodei (2017)

arXiv.

DOI: https://doi.org/10.48550/arxiv.1706.03741

Abstract. Introduces reinforcement learning from human preferences (the technique underlying RLHF): learning a reward model from pairwise human preference judgements and then optimising a policy against it with a KL penalty toward a reference distribution.

Tags: rlhf alignment reinforcement-learning

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Deep reinforcement learning from human preferences