Modern AI: 15.5   RLHF: from preferences to policies

Dr Chris Paton

15.5 RLHF: from preferences to policies

A pretrained language model is, at heart, a very accomplished mimic. Trained on hundreds of billions of words of internet text, books, papers, and code, it learns one job: given a stretch of text, guess what comes next. Ask it a question and it will produce text that looks like what people on the internet write after such questions, which is sometimes a clear answer, sometimes a list of further questions, sometimes a snarky reply, and sometimes a rambling tangent that drifts into unrelated territory. The model is fluent. It is not, in any deliberate sense, helpful.

Reinforcement Learning from Human Feedback, almost always shortened to RLHF, is the technique that closes this gap. It is a post-training procedure: after the expensive pretraining run finishes, we take the resulting base model and nudge it, using human judgements, toward producing the kind of replies that people actually want. The nudging is done in two steps. First we collect human comparisons, pairs of model outputs where a person says "this one is better than that one", and we train a small neural network, called a reward model, to imitate those judgements. Then we use reinforcement learning to push the language model toward producing replies that the reward model scores highly, while a leash keeps the model from straying too far from where it started.

This procedure, in various dialects, is what turned GPT-3 into InstructGPT and then into ChatGPT. It is what makes Claude write polite, careful answers. It sits inside Llama 2 and Llama 3, Gemini, and effectively every modern frontier model that you can interact with by typing a question. RLHF did not invent the underlying machinery, Bradley–Terry comparisons date to 1952, PPO to 2017, but its synthesis, which arrived in published form in Christiano et al. (2017) and was scaled up in Ouyang et al. (2022), is one of the small handful of ideas that distinguishes the chatbots of today from the inscrutable language predictors of five years ago.

This section covers what RLHF is, why it works, and where it tends to break. The previous section, §15.4, covered supervised fine-tuning (SFT), which is RLHF's necessary precursor: the model has to be capable of producing reasonable replies before we can compare them. The next section, §15.6, covers Direct Preference Optimisation (DPO), a reward-free alternative that has, in the years since its 2023 publication, taken over a large fraction of what was once RLHF territory.

Symbols Used Here

$\pi_\theta$language model policy parameterised by $\theta$

$\pi_{\text{ref}}$reference policy (the initial SFT model)

$r_\phi$reward model with parameters $\phi$

$\beta$KL penalty coefficient

$y_w, y_l$winning and losing response in a preference pair

$x$prompt (the input we condition on)

$\sigma$sigmoid function

The three-stage pipeline

RLHF, as practised in 2026, is a three-stage assembly line. Each stage hands its output to the next, and a slip-up in any one stage tends to manifest later as a model that is unhelpful, dishonest, or both.

Stage one: Supervised fine-tuning (SFT). The base pretrained model is fine-tuned on a relatively small dataset of high-quality instruction-following demonstrations. These are prompt-response pairs written by humans (or curated from existing chat logs) where the response is what we would like the model to produce. The dataset is small by pretraining standards, tens of thousands of pairs, perhaps a hundred thousand, and the training run is short, typically one to three epochs. The objective is the ordinary next-token cross-entropy. After this stage, the model knows the conversational shape of a helpful answer: it stops rambling, it addresses the question, it produces something in the right register.

Stage two: Reward modelling. We now collect preference data. A prompt $x$ is shown to the SFT model, which generates two distinct responses $y_1$ and $y_2$, and a human annotator picks the one they prefer. The pair is recorded as $(x, y_w, y_l)$, where $y_w$ is the winner and $y_l$ is the loser. After perhaps ten thousand to a hundred thousand such comparisons, we train a separate neural network, the reward model $r_\phi$, to take a prompt and a response and emit a single scalar score, with the constraint that $r_\phi(x, y_w) > r_\phi(x, y_l)$ for as many pairs as possible. The reward model is, in effect, a learned imitation of human taste.

Stage three: Reinforcement learning. We now use the reward model as a teacher to push the SFT model toward producing higher-scoring responses. Concretely, we treat the language model as a policy $\pi_\theta$ that, given a prompt, generates a response token by token, and we use reinforcement learning to maximise

$$ \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)}\bigl[r_\phi(x, y)\bigr] - \beta \, \mathbb{D}_{\text{KL}}\bigl(\pi_\theta(\cdot \mid x) \,\big\|\, \pi_{\text{ref}}(\cdot \mid x)\bigr). $$

The first term rewards the policy for generating responses the reward model likes. The second term, scaled by $\beta$, penalises the policy for drifting too far from the reference policy $\pi_{\text{ref}}$, which is just the frozen SFT model from stage one. This penalty is the leash. Without it, the policy will discover and exploit any flaws in the reward model and produce gibberish that scores well; with it, the policy makes only modest, well-behaved adjustments. The whole optimisation is solved with Proximal Policy Optimisation (PPO), the same algorithm that has been a workhorse of game-playing reinforcement learning since 2017.

Each stage typically takes longer than the last in wall-clock time, but cheaper than the last in human labour. Stage one needs writers; stage two needs annotators; stage three is pure compute. Together they form the recipe that took GPT-3 to ChatGPT.

Reward modelling

The mathematical heart of stage two is the Bradley–Terry model (Bradley & Terry, 1952), which assumes there is some true latent reward $r^*(x, y)$ such that the probability a human picks $y_w$ over $y_l$ is

$$ \Pr(y_w \succ y_l \mid x) = \frac{\exp r^*(x, y_w)}{\exp r^*(x, y_w) + \exp r^*(x, y_l)} = \sigma\bigl(r^*(x, y_w) - r^*(x, y_l)\bigr). $$

This is, charmingly, the same equation that produces Elo ratings in chess. The probability that one player beats another depends on the difference in their Elo numbers, passed through a sigmoid. We are simply assigning Elo ratings to language model outputs.

Given a dataset $\mathcal{D} = \{(x_i, y_w^i, y_l^i)\}$, we fit a parameterised reward model $r_\phi$ by minimising the negative log-likelihood:

$$ \mathcal{L}_{\text{RM}}(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \log \sigma\bigl(r_\phi(x, y_w) - r_\phi(x, y_l)\bigr). $$

In words: the reward model is trained so that, for every preference pair, it scores the winner higher than the loser by enough that a sigmoid rounds the difference toward one. The model never sees an absolute target score, it only sees relative comparisons, which is helpful, because absolute scores from human raters are notoriously unreliable. People disagree about whether a given answer is a 7 or an 8. They agree much more readily on whether one answer is better than another.

Architecturally, the reward model is almost always initialised from the SFT model, with the language modelling head (which predicts a probability distribution over the vocabulary) replaced by a scalar head, a single linear layer that projects the final-layer hidden state at the end-of-sequence token down to one number. This is parameter-efficient: you reuse all of the language understanding and only train a fresh scalar projection plus, optionally, a few of the upper layers.

How much preference data is enough? Production-grade reward models in 2026 are typically trained on between 10000 and 1000000 preference pairs, depending on the budget and the breadth of the desired behaviour. For comparison, OpenAI's original InstructGPT reward model used about 33000 pairs; Anthropic's Helpful-Harmless-Honest dataset is in the hundreds of thousands. The signal saturates more slowly than people expect, because preference data is broad, it has to cover not just helpfulness but truthfulness, harmlessness, formatting, register, refusal behaviour, and a hundred other implicit dimensions. A reward model with only a few thousand examples is too easy to fool.

The PPO-based optimisation

Once the reward model is in hand, stage three runs Proximal Policy Optimisation (PPO; Schulman et al., 2017). PPO was designed for video-game reinforcement learning, not language, but the abstraction transfers cleanly. The language model is the policy; each token it emits is an action; the prompt and the tokens generated so far are the state. The reward arrives mostly at the final token, when the response is complete and we ask the reward model to score it. The KL penalty against the reference is added on a per-token basis as a small running cost.

The full per-token reward used in practice is

$$ r_t = \mathbb{1}[t = T] \cdot r_\phi(x, y) - \beta \cdot \log\frac{\pi_\theta(y_t \mid x, y_{

where the first term contributes only at the end of the sequence and the second contributes at every step. This decomposition is convenient because it converts the sequence-level KL constraint into a per-token reward signal, which slots into standard RL machinery without modification.

PPO then optimises a clipped surrogate objective:

$$ \mathcal{L}_{\text{PPO}}(\theta) = \mathbb{E}_t\Bigl[\min\bigl(r_t^\theta(\theta) \hat{A}_t,\; \text{clip}(r_t^\theta(\theta), 1 - \varepsilon, 1 + \varepsilon)\, \hat{A}_t\bigr)\Bigr], $$

where $r_t^\theta(\theta) = \pi_\theta / \pi_{\theta_{\text{old}}}$ is the importance ratio between the current policy and the rollout policy, $\hat{A}_t$ is the generalised advantage estimate produced by a separate value head, and $\varepsilon$, usually around $0.2$, bounds how far the policy can move on any one update. The clipping is what makes PPO proximal: it refuses to take the gradient seriously when it implies a large step, which keeps optimisation stable.

The KL leash deserves its own paragraph. The coefficient $\beta$ controls how aggressively the policy can drift from the reference. Set $\beta$ too small and the policy bolts: within a few hundred steps it discovers regions of response space where the reward model is wrong but generous, and it produces fluent gibberish that scores beautifully. Set $\beta$ too large and the policy barely moves; the run is a waste of compute. Standard practice is to start with $\beta \in [0.05, 0.2]$, monitor mean per-token KL during training, and target a final KL of perhaps 8–15 nats per response. An adaptive controller (Ziegler et al., 2020) closes the loop: if KL drifts above target, multiply $\beta$ by 1.5; if below, divide. This eliminates one hyperparameter at the cost of a touch of dynamics.

The engineering footprint of a PPO-RLHF run is hefty. Four models live in GPU memory simultaneously: the policy being trained, a frozen copy of the reference for KL, the reward model for terminal scoring, and a value model for advantage estimation. On a 70-billion-parameter base model, that is 280 billion parameters of weights, plus optimiser state for the policy and value model. This is one of the main reasons reward-free methods such as DPO (§15.6) have eaten into RLHF's territory.

Practical concerns

RLHF works in production but is well known to misbehave in characteristic ways. Reward hacking is the headline failure: the policy discovers prompts or response styles where the reward model is mistakenly generous, and exploits them ruthlessly. Classic symptoms include responses that begin with "Certainly!" or "I'd be happy to" regardless of whether the request makes sense, excessive hedging ("As an AI language model, I cannot..."), and lists of bullet points where prose would be clearer. Each of these correlates, on the training distribution, with higher reward; the policy learns the correlation and the prose suffers.

Mode collapse is a related pathology. Because PPO rewards consistent high scores, the policy tends to converge on a narrow distribution of safe replies. The base model that, given the prompt "Write a poem about a cat," might have produced any of a thousand stylistically distinct poems, post-RLHF reliably produces an inoffensive AABB quatrain about a fluffy companion. Diversity has been sacrificed for reliability. For some applications, code generation, factual question answering, this is a feature; for others, creative writing, it is a serious cost, and several research groups now explicitly preserve diversity by mixing in a small entropy bonus.

Overoptimisation, sometimes called the Goodhart effect on the reward model, was diagnosed by Gao, Schulman, and Hilton in 2022. As PPO training proceeds, the reward model's score keeps rising, but human preference (measured against a held-out gold reward model) rises only briefly before plateauing and then falling. The policy is learning to please the reward model in ways the reward model does not deserve. The standard mitigation is early stopping based on a held-out preference set; the more sophisticated mitigation is to retrain the reward model periodically using fresh preference data sampled from the current policy, which is exactly what iterative RLHF does.

Cost rounds out the list. A full RLHF cycle requires (a) writing the SFT data, (b) running SFT, (c) collecting tens of thousands of comparisons, (d) training a reward model, (e) running PPO, and (f) evaluating the result, often by collecting yet more comparisons. The two human-labour passes, SFT data and preference data, dominate the bill. For Anthropic's and OpenAI's frontier models, the human-labour budget runs into seven and eight figures per release.

Worked example

Suppose the prompt is "Tell me about photosynthesis." The SFT model, sampled twice with non-zero temperature, produces two completions:

$y_w$: "Photosynthesis is the process by which green plants, algae, and cyanobacteria convert light energy, usually from the sun, into chemical energy stored in glucose. The reaction takes place in the chloroplasts and uses water from the soil and carbon dioxide from the air, releasing oxygen as a by-product. The overall reaction is $6\,\text{CO}_2 + 6\,\text{H}_2\text{O} \to \text{C}_6\text{H}_{12}\text{O}_6 + 6\,\text{O}_2$. It happens in two stages: the light-dependent reactions (in the thylakoid membranes) and the light-independent Calvin cycle (in the stroma)."
$y_l$: "Photosynthesis is when plants make food from sunlight."

A human annotator, presented with both, picks $y_w$. The pair $(x, y_w, y_l)$ goes into the reward modelling dataset. A few thousand comparisons later, the reward model has learned a policy preference: longer, more substantive, more accurate, and well-structured replies score higher than terse generic ones, particularly on prompts that look factual.

In stage three, we now run PPO. Given the same prompt, the current policy samples a response, the reward model scores it, the KL penalty is computed token by token against the SFT reference, and the policy weights are nudged. After a few hundred PPO updates, the policy reliably produces replies that resemble $y_w$: they include the chloroplast detail, they mention the Calvin cycle, they format equations sensibly. The policy has not learnt new biology, it knew all of that from pretraining, but it has learnt to deploy what it knows in the format the reward model has rewarded.

Now watch the reward hacking. If the reward model has accidentally learned that responses containing chemical equations score very well, the policy may begin producing chemical equations even when asked unrelated questions ("What is the capital of France? The answer is $\text{C}_6\text{H}_{12}\text{O}_6$..."). If the reward model rewards length, the policy may produce 800-word answers to "What time is it?". The KL leash and early stopping are what prevent the worst of this; vigilant evaluation is what catches the rest.

Where RLHF appears in modern AI

RLHF, in some form, sits inside almost every commercial chatbot in 2026. ChatGPT (Ouyang et al., 2022) was the first product in which the procedure was applied at scale, and OpenAI continues to refine it through GPT-4 and the o-series. Claude uses RLHF for helpfulness alongside Anthropic's distinctive Constitutional AI approach for harmlessness, the latter is a reward-from-AI-feedback variant, discussed below. Llama 2 and Llama 3 chat models from Meta were trained with a multi-round RLHF pipeline involving rejection sampling and PPO, with Llama 3 incorporating DPO for some passes. Gemini at Google DeepMind uses RLHF together with newer self-generated reasoning traces. DeepSeek's instruct models, Mistral's chat variants, and the major Chinese frontier models, Qwen, Yi, GLM, all run some flavour of preference post-training.

The procedure has, in this sense, won. For roughly five years, "make the chatbot useful" has been a problem with a published solution, and the engineering investment behind that solution, preference annotation pipelines, reward model evaluation suites, PPO infrastructure capable of holding four 70B models in GPU memory, represents one of the larger applied-ML investments of the decade.

Variants

A full taxonomy of preference-training methods would fill a chapter; we mention three.

Constitutional AI (Bai et al., 2022) replaces some or all of the human harmlessness preferences with model-generated ones. The model is prompted with a constitution, a list of principles such as "do not encourage illegal activity", and asked to critique and rewrite its own outputs in light of those principles. The resulting (original, rewritten) pairs become preference data. This makes the harmlessness pipeline cheap and scalable, but transfers the burden onto the constitution itself.

RLAIF (Reinforcement Learning from AI Feedback) generalises Constitutional AI: the reward model is trained on AI-generated preferences rather than human ones. When the AI rater is competent, say, a frontier model rating outputs from a smaller model, RLAIF can match RLHF at a fraction of the cost. When it is not, the smaller model inherits the larger one's blind spots.

Iterative RLHF runs the whole pipeline more than once. After PPO has shifted the policy, the new policy is used to generate fresh comparison data, a fresh reward model is trained, and PPO is rerun. This is now standard practice at major labs, with some Llama 3 variants iterating six or more rounds. The cost is roughly linear in the number of rounds; the benefit, in head-to-head comparison, accumulates.

What you should take away

A pretrained model is fluent but not helpful. RLHF is the post-training procedure that makes it answer questions usefully, refuse appropriately, and keep its tone polite.
The pipeline has three stages: SFT, then reward modelling, then PPO. Each stage hands its output to the next, and a flaw at any stage compounds downstream.
The reward model is a learned imitation of human taste, fitted via the Bradley–Terry log-likelihood on pairwise comparisons. Architecturally it is the SFT model with a scalar head.
PPO maximises reward minus a KL penalty against the reference. The KL is the leash that prevents the policy from reward-hacking; choosing $\beta$ is the touchiest hyperparameter in the run.
Reward hacking, mode collapse, and overoptimisation are the standard failure modes. All three are mitigated by careful $\beta$ tuning, held-out evaluation, and iterative refresh of the preference data, and all three motivate the reward-free methods of §15.6.