Abstract. Introduces Direct Preference Optimization (DPO), which eliminates the explicit reward model and PPO step of RLHF by directly optimising the policy on preference pairs through a classification-like loss. DPO is simpler and more stable than PPO-based RLHF.