DPO Variants, Glossary, Textbook of AI

DPO variants are the family of preference-optimisation objectives that emerged in 2023–2024 to address specific failure modes of dpo (Rafailov et al., 2023). Each variant reformulates the core insight, that the optimal RLHF policy under a Bradley-Terry preference model can be learned directly from preference pairs without an explicit reward model, to fix a particular weakness of the original loss.

DPO itself is the baseline. Given preference pairs $(x, y_w, y_l)$ with $y_w$ preferred over $y_l$, it minimises

$$\mathcal{L}_\mathrm{DPO} = -\mathbb{E}\left[ \log \sigma\Big( \beta \log \frac{\pi_\theta(y_w | x)}{\pi_\mathrm{ref}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_\mathrm{ref}(y_l | x)} \Big) \right],$$

pushing the policy log-ratio for $y_w$ above that for $y_l$ by a margin scaled by $\beta$. The headline failure mode is overfitting: when $y_w$ and $y_l$ are very close, the loss saturates slowly and the model learns to push $\pi(y_w)$ very high relative to $\pi_\mathrm{ref}$ without bound, degrading the policy on unseen prompts.

IPO (Identity Preference Optimisation), Azar et al. (2023), replaces the implicit log-sigmoid with a squared-loss formulation and a target margin, removing the over-optimisation tendency:

$$\mathcal{L}_\mathrm{IPO} = \mathbb{E}\left[ \Big( \log \frac{\pi_\theta(y_w | x) \pi_\mathrm{ref}(y_l | x)}{\pi_\mathrm{ref}(y_w | x) \pi_\theta(y_l | x)} - \frac{1}{2\tau} \Big)^2 \right].$$

The squared loss has bounded gradients, so saturated pairs do not push the ratio to infinity.

KTO (Kahneman-Tversky Optimisation), Ethayarajh et al. (2024), eliminates the requirement for paired data. Instead of $(y_w, y_l)$ pairs, KTO needs only unary desirable/undesirable labels per response, modelled with a Kahneman-Tversky prospect-theory utility. The objective rewards desirable responses and penalises undesirable ones relative to a reference, with separate gain/loss aversion coefficients $\lambda_D$ and $\lambda_U$, typically $\lambda_U > \lambda_D$ to reflect human loss aversion. KTO is much more practical at deployment scale because thumbs-up/thumbs-down feedback is far easier to collect than ranked pairs.

ORPO (Odds Ratio Preference Optimisation), Hong et al. (2024), folds preference learning into the SFT stage by adding an odds-ratio penalty to the standard log-likelihood:

$$\mathcal{L}_\mathrm{ORPO} = \mathcal{L}_\mathrm{SFT}(y_w) + \lambda \cdot \log \sigma\Big( \log \frac{\mathrm{odds}_\theta(y_w | x)}{\mathrm{odds}_\theta(y_l | x)} \Big).$$

Crucially, ORPO has no reference model $\pi_\mathrm{ref}$, it does SFT and preference optimisation in a single stage and a single forward/backward pass, halving memory.

SimPO, Meng et al. (2024), removes the reference model entirely with a simpler length-normalised contrastive objective:

$$\mathcal{L}_\mathrm{SimPO} = -\log \sigma\Big( \frac{\beta}{|y_w|} \log \pi_\theta(y_w | x) - \frac{\beta}{|y_l|} \log \pi_\theta(y_l | x) - \gamma \Big).$$

The length normalisation $|y|$ fixes a known DPO bias (longer responses get amplified log-probabilities) and the explicit margin $\gamma$ replaces the reference-model implicit margin. SimPO often matches or beats DPO on chat benchmarks at half the memory.

The practical landscape circa 2025: DPO remains the most used, ORPO and SimPO win on memory-constrained training, KTO wins when only thumbs-up/down feedback is available, IPO is preferred when overfitting is a concern. All five are drop-in alternatives to PPO in HuggingFace's TRL library, and most production post-training stacks now run a DPO variant rather than full RLHF for the alignment phase.

Discussed in:

Chapter 16: Ethics & Safety, Direct Preference Optimisation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).