Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, & Chelsea Finn (2023), References, Textbook of AI

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, & Chelsea Finn (2023)

arXiv.

DOI: https://doi.org/10.48550/arxiv.2305.18290

Abstract. Introduces Direct Preference Optimization (DPO), which eliminates the explicit reward model and PPO step of RLHF by directly optimising the policy on preference pairs through a classification-like loss. DPO is simpler and more stable than PPO-based RLHF.

Tags: rlhf alignment dpo

Cited in:

Chapter 16: Ethics & Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Direct Preference Optimization: Your Language Model is Secretly a Reward Model