Modern AI: 15.6   DPO and the reward-free family

Dr Chris Paton

15.6 DPO and the reward-free family

In the previous section we walked through RLHF, which is the recipe that took raw GPT-style models and turned them into the polite, helpful chat assistants people now use every day. The recipe had three ingredients: a supervised fine-tune, a separately trained reward model that scores responses, and a reinforcement-learning loop (typically PPO) that nudges the policy toward higher-reward completions while a KL penalty keeps it from drifting too far from the reference. It works, but it is fiddly. You hold four large models in GPU memory at once, the policy being trained, a frozen reference, the reward model, and an old copy of the policy used for the PPO ratio. Hyperparameters are sensitive, training is unstable, and a small bug in the reward model can quietly poison the whole run.

In May 2023 a paper by Rafael Rafailov and colleagues at Stanford 2023 showed something startling. The reward model and the RL loop are not actually necessary. With a few lines of algebra you can fold the reward into the policy itself and turn the whole alignment problem into a single supervised classification loss over preference pairs. They called the method Direct Preference Optimisation, or DPO. It is simpler to implement, almost as effective as PPO-based RLHF in practice, and it became the default preference-tuning method for open-weight chat models almost overnight. Llama-2-Chat community fine-tunes, Zephyr, Tulu, Mistral-Instruct derivatives, most of these used DPO rather than PPO.

This section explains why DPO works, what its loss function is, why it is so much easier than RLHF, what variants have appeared since (IPO, KTO, ORPO, SimPO), where it succeeds, where it falls short, and where it is used in practice. It bridges §15.5 (RLHF, the original recipe) and §15.7 (GRPO, which extends preference-style RL to reasoning models).

Symbols Used Here

$\pi_\theta$the policy we are training (the chatbot, with trainable weights $\theta$)

$\pi_{\text{ref}}$the reference policy, a frozen copy of the SFT model

$\beta$a temperature hyperparameter that controls how far $\pi_\theta$ may stray from $\pi_{\text{ref}}$

$y_w$the response a human preferred (the "winner")

$y_l$the response a human rejected (the "loser")

$x$the input prompt

$\sigma$the logistic sigmoid, $\sigma(z) = 1/(1 + e^{-z})$

The DPO derivation

The cleverness of DPO is purely mathematical. Nothing about the training data changes, you still need pairs of responses with a human-labelled preference. What changes is that you skip the step of training a separate scorer.

Recall the RLHF objective. We want a policy that earns high reward but does not run off into gibberish. So we maximise expected reward minus a KL penalty against the reference:

$$\max_\pi \; \mathbb{E}_{y \sim \pi(\cdot \mid x)}[r(x, y)] - \beta \, \mathbb{D}_{\text{KL}}(\pi(\cdot \mid x) \,\|\, \pi_{\text{ref}}(\cdot \mid x)).$$

If you treat $\pi$ as a free-form probability distribution and use a Lagrange multiplier for the constraint that probabilities sum to one, this objective has a closed-form maximum:

$$\pi^*(y \mid x) = \frac{1}{Z(x)} \, \pi_{\text{ref}}(y \mid x) \, \exp\!\left(\frac{1}{\beta} r(x, y)\right),$$

where $Z(x) = \sum_y \pi_{\text{ref}}(y \mid x) \exp(r(x,y)/\beta)$ is a normalising constant. Read this carefully. The optimal aligned policy is just the reference policy re-weighted by an exponentiated reward. High-reward answers get boosted; low-reward answers get squashed; the temperature $\beta$ controls how sharply.

This formula is elegant but useless on its own. To actually compute $\pi^*$ you would need to evaluate $Z(x)$, which sums over every possible token sequence, an astronomical set. RL with PPO sidesteps this by sampling.

DPO sidesteps it differently. Take the equation and rearrange it for $r$:

$$r(x, y) = \beta \log \frac{\pi^*(y \mid x)}{\pi_{\text{ref}}(y \mid x)} + \beta \log Z(x).$$

In words: the reward of a response is, up to a prompt-dependent constant, the log-ratio of the aligned policy to the reference, scaled by $\beta$. The reward is hidden inside the policy. Now plug this into the Bradley-Terry model from §15.5, which says the probability that a human prefers $y_w$ over $y_l$ is the sigmoid of the reward difference:

$$\Pr(y_w \succ y_l \mid x) = \sigma\bigl(r(x, y_w) - r(x, y_l)\bigr).$$

When you substitute $r$ from the equation above, the awkward $\beta \log Z(x)$ term, which depends only on $x$, not on the response, appears with the same sign for $y_w$ and $y_l$ and cancels exactly. What is left is

$$\Pr(y_w \succ y_l \mid x) = \sigma\!\left(\beta \log \frac{\pi^*(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi^*(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right).$$

This is a likelihood we can directly optimise. Replace $\pi^*$ with our trainable policy $\pi_\theta$ and minimise the negative log-likelihood of the observed preferences. The DPO loss is

$$\mathcal{L}_{\text{DPO}}(\theta) = -\,\mathbb{E}_{(x, y_w, y_l)}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right].$$

Stare at this for a moment, because it is doing real work. Each preference pair contributes a single scalar. To compute it you only need the log-probabilities the two policies assign to $y_w$ and to $y_l$. There is no reward model, no roll-out, no separate value function, no sampling beyond the data you already collected. The reward has been absorbed into the implicit log-ratio $\beta \log(\pi_\theta / \pi_{\text{ref}})$.

Why this is much simpler than RLHF

To appreciate the simplification, line up what each method actually has to do per training step.

RLHF with PPO keeps four things in memory: the policy $\pi_\theta$, the frozen reference $\pi_{\text{ref}}$, an old policy $\pi_{\theta_{\text{old}}}$ for the importance-sampling ratio, and the reward model. Each step you sample a batch of completions from the current policy (auto-regressive generation, not cheap), score every token of every completion with the reward model, compute advantages with GAE, evaluate the PPO clipped surrogate, add a KL penalty term, and back-propagate. Generation alone burns most of the wall-clock time. PPO's clipping range, the KL coefficient, the value-function coefficient, and the entropy bonus all need tuning. If the reward model is even slightly miscalibrated, the policy will exploit its weak spots, the famous reward-hacking failure mode.

DPO keeps two things: the policy $\pi_\theta$ and the frozen reference $\pi_{\text{ref}}$. Each step you take a preference pair from your dataset, push $y_w$ and $y_l$ through both networks to get four log-probabilities, plug them into the loss above, and back-propagate. There is no sampling. There is no reward model to train and to fight with. Training looks identical to ordinary supervised cross-entropy fine-tuning, and you can use the same Adam optimiser, the same learning-rate schedule, the same gradient-accumulation tricks you already use for SFT.

The practical consequences are large. GPU memory drops by roughly half. Training time drops by even more, because no auto-regressive sampling happens during the loss computation. Stability improves: DPO does not blow up the way PPO sometimes does. A small lab with a single 8-GPU node can run DPO on a 7-billion-parameter model in a few hours; matching that with PPO would be a multi-day production. Hyperparameter sensitivity is roughly that of supervised fine-tuning, you mainly need to set $\beta$ (a typical value is between 0.1 and 0.5) and a learning rate (often around $5 \times 10^{-7}$ for 7B models). These are easy to sweep over.

The cost of the simplification is conceptual: with no reward model in the loop you cannot do online RL, you cannot easily add new reward signals after the fact, and you are committed to the preference data you have. For a great many practical alignment jobs this is an entirely acceptable trade.

Variants

Within a year of the DPO paper, a small zoo of variants appeared. Each one modifies one piece of the DPO loss to fix a specific complaint.

IPO, Identity Preference Optimisation Azar, 2023. The DPO loss uses a sigmoid, which saturates. When the preference data is unanimous (every annotator picks $y_w$), the implicit reward gap can be driven to infinity without bound, leading to over-fitting and degenerate solutions. IPO replaces the sigmoid with a squared-error term. The loss becomes a regression of the implicit log-ratio toward a fixed target, which keeps the optimisation well-behaved on noisy or unanimous data.

KTO, Kahneman-Tversky Optimisation Ethayarajh, 2024. DPO needs paired data: for every prompt you must collect both a winner and a loser. KTO drops this requirement and works with unpaired (prompt, response, label) triples, where the label is just "good" or "bad". The loss is asymmetric, modelled on prospect theory's loss-aversion curve from behavioural economics. This makes data collection much cheaper, because you can recycle existing single-rated data.

SimPO, Simple Preference Optimisation Meng, 2024. Even with DPO you must keep the reference policy in memory and run two forward passes per example. SimPO removes the reference entirely and normalises by response length instead. Surprisingly, this works competitively on most benchmarks while halving memory and forward-pass cost.

ORPO, Odds-Ratio Preference Optimisation Hong, 2024. Standard practice runs SFT first, then preference training. ORPO combines them. It adds an odds-ratio preference term to the ordinary SFT cross-entropy loss, so a single training stage handles both objectives. This is convenient when you have mixed data and a tight compute budget.

The space now resembles classification losses circa 2010, many close cousins, modest empirical differences, no clear universal winner. A reasonable starting prescription: use DPO with clean paired data, switch to ORPO when you want to combine SFT and preference into one stage, switch to KTO when you can only afford pointwise labels, switch to SimPO when you are GPU-constrained, switch to IPO when your preferences are noisy or unanimous.

When DPO works (and doesn't)

DPO is at its best when the preference data is clean, the reference policy is already well-aligned by SFT, and the desired behaviour is a relatively narrow steering of an already capable model. In that regime it routinely matches PPO-based RLHF on chat benchmarks while being a fraction of the engineering effort.

It struggles in three settings.

First, when the reward signal is genuinely noisy. A separately trained reward model can absorb labelling noise by averaging across many examples; DPO sees each pair directly and has no such smoothing mechanism. If 20% of your annotators disagree on what counts as a better answer, DPO will internalise that disagreement as conflicting gradients.

Second, when exploration matters. PPO actively samples new completions during training and so can discover phrasings the reference would not have produced. DPO is purely off-policy with respect to a fixed dataset, so it can only re-weight responses that already appear in the data. For tasks like creative reasoning chains or tool-use trajectories, this matters. GRPO (§15.7), used for the DeepSeek-R1 reasoning models, returns to on-policy sampling for exactly this reason.

Third, when the preference distribution is far from the reference. The loss assumes $\pi_\theta$ stays close to $\pi_{\text{ref}}$, the KL penalty is built into the derivation. If you push too hard with too small a $\beta$, the implicit reward inflates without any external check and you can push probability mass onto degenerate strings. In practice this shows up as outputs that earn high implicit reward but are nonsense. The fix is to keep $\beta$ on the larger end (closer to 0.5) when in doubt, or to use IPO.

DPO and PPO-style RLHF are now both routine tools, and the choice between them is a question of engineering trade-offs rather than fundamental capability.

Where DPO is used

DPO is everywhere in the open-weight ecosystem. The Llama-2-Chat community fine-tunes (Zephyr, Tulu-2, Nous-Hermes, OpenChat) almost all use DPO. Mistral-7B-Instruct derivatives use DPO. Qwen, Yi, and DeepSeek base-model alignments include DPO stages. Hugging Face's trl library ships a one-line DPOTrainer class that has been downloaded millions of times. For academic groups, hobbyists, and small companies who lack the GPU budget for a full PPO pipeline, DPO has democratised preference tuning.

The frontier labs are quieter about exactly what they use, but published reports suggest a mixed picture. Anthropic's Claude line uses Constitutional AI followed by RLHF-style training; OpenAI's GPT-4 family used PPO-based RLHF for the headline models, and reportedly experimented with DPO for some downstream variants; Google's Gemini reports describe both. The frontier seems to use PPO-style RL when on-policy exploration helps (especially for reasoning) and DPO-style preference losses for cheaper alignment passes.

Beyond chat, DPO has been applied to image-generation models, code models, summarisation, translation post-editing, and (with appropriate redefinition of "response") to multi-step tool-use and agent trajectories. The math does not care what the responses are, only that you have preference pairs. Diffusion-DPO, for example, applies the same loss to image diffusion models by reframing the log-ratio as a log-likelihood of denoising trajectories; preference-pair data of "I prefer this image over that one" then steers the model toward more aesthetically pleasing or instruction-following outputs without any image-quality reward model. Code-DPO trains code-generation models on pairs of (correct, incorrect) completions; the loss does not care that the preference comes from a unit-test runner rather than from a human. This generality is a quiet but important consequence of the math: by hiding the reward inside the policy, DPO removes the requirement that rewards be human-shaped.

Within enterprise deployments, a common pattern is a small final DPO pass on top of a larger SFT-and-RLHF base. A team will take a model the foundation lab has already aligned with full RLHF, collect a few thousand domain-specific preference pairs (legal writing style, medical voice, internal product tone), and run a brief DPO fine-tune. This kind of last-mile customisation would have been impractical with a full PPO pipeline. With DPO it is overnight work for a single engineer.

What you should take away

DPO replaces RLHF's reward model and PPO loop with a single classification loss. The same preference pairs feed in, but the training looks like supervised fine-tuning rather than RL.
The trick is algebraic. The closed-form RLHF optimum is a re-weighting of the reference policy by the exponentiated reward, so the reward can be re-expressed as a log-ratio of policy to reference. Substituted into Bradley-Terry, the awkward partition function cancels and a simple sigmoid-of-log-ratio loss remains.
DPO is dramatically simpler in practice. Two models in memory instead of four, no roll-outs, no reward-model training, standard Adam optimisation. Training time drops by an order of magnitude on small GPUs and stability is much improved.
A family of variants now exists. IPO fixes saturation issues, KTO works with unpaired labels, SimPO drops the reference policy, ORPO folds SFT and preference into one stage. None dominates universally; pick by what your data and compute allow.
DPO is not always the right choice. When preferences are noisy, when active exploration matters, or when the desired policy is far from the reference, PPO-style RLHF or its reasoning-focused descendant GRPO (§15.7) may still pay for the extra complexity. For most alignment work on open-weight models, however, DPO is now the default.