References

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, & Chelsea Finn (2023)

arXiv.

DOI: https://doi.org/10.48550/arxiv.2305.18290

Abstract. Introduces Direct Preference Optimization (DPO), which eliminates the explicit reward model and PPO step of RLHF by directly optimising the policy on preference pairs through a classification-like loss. DPO is simpler and more stable than PPO-based RLHF.

Tags: rlhf alignment dpo

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).