Glossary

RLHF

Also known as: RLHF

Reinforcement Learning from Human Feedback (RLHF) is the most influential technique for aligning large language models with human values. A language model trained only on next-token prediction is not helpful, harmless, or honest by default, it is just a statistical model of text, capable of generating continuations in any style including toxic or misleading ones. RLHF adjusts the model to produce responses that humans actually prefer.

The pipeline has three stages. First, the pre-trained and supervised-fine-tuned model generates pairs of responses to the same prompt. Human annotators rank these according to criteria like helpfulness, accuracy, and safety. Second, a reward model, typically a transformer similar in size to the language model, is trained on the preference data to predict human preference scores. Third, the language model is optimised to maximise the reward model's score using Proximal Policy Optimisation (PPO), with a KL divergence penalty keeping it close to the supervised fine-tuning distribution to prevent reward hacking.

RLHF was the key ingredient in making systems like ChatGPT and Claude usable by the general public, the difference between a base model and an aligned model is dramatic. Direct Preference Optimisation (DPO) offers a simpler alternative that eliminates the explicit reward model and the instabilities of RL, training directly on preference data with a classification-like loss. Constitutional AI (CAI) reduces reliance on human labelling by having the model critique and revise its own outputs according to explicit principles. Despite its successes, RLHF faces challenges: reward hacking, specification gaming, the difficulty of representing diverse values, and the open problem of scalable oversight.

Mathematics

The reward model is trained from pairwise human preferences. Given prompts $x$ and pairs of responses $(y_w, y_l)$ where $y_w$ is preferred over $y_l$, the model $r_\phi(x, y)$ minimises the Bradley-Terry loss:

$$\mathcal{L}_R(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\!\left[\log \sigma\bigl(r_\phi(x, y_w) - r_\phi(x, y_l)\bigr)\right]$$

which is the log-likelihood of the preferences under the assumption $P(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))$.

The policy $\pi_\theta(y | x)$ is then trained by reinforcement learning to maximise reward, regularised by a KL penalty against a reference policy $\pi_{\mathrm{ref}}$ (typically the SFT model):

$$\max_\theta \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot | x)}\!\left[r_\phi(x, y)\right] - \beta \, D_{\mathrm{KL}}\!\bigl(\pi_\theta(\cdot | x) \,\|\, \pi_{\mathrm{ref}}(\cdot | x)\bigr)$$

The KL regularisation prevents the policy drifting too far from the reference, mitigating reward hacking and preserving the language-modelling capability that the SFT model achieved.

PPO is the standard optimiser. It uses a clipped surrogate objective to prevent overly large policy updates within each update step:

$$\mathcal{L}_{\mathrm{PPO}}(\theta) = \mathbb{E}_t\!\left[\min\!\bigl(\rho_t(\theta) \hat A_t, \,\mathrm{clip}(\rho_t(\theta), 1 - \varepsilon, 1 + \varepsilon) \hat A_t\bigr)\right]$$

where $\rho_t(\theta) = \pi_\theta(y_t | x_{\lt t}) / \pi_{\theta_{\mathrm{old}}}(y_t | x_{\lt t})$ is the importance-sampling ratio and $\hat A_t$ is an advantage estimate computed from the reward model and (typically) a learned value baseline.

DPO (Direct Preference Optimization, Rafailov 2023) bypasses the reward model and RL entirely, reframing the optimum of the RLHF objective as a classification loss directly on preferences:

$$\mathcal{L}_{\mathrm{DPO}}(\theta) = -\mathbb{E}\!\left[\log \sigma\!\left(\beta \log \frac{\pi_\theta(y_w | x)}{\pi_{\mathrm{ref}}(y_w | x)} - \beta \log \frac{\pi_\theta(y_l | x)}{\pi_{\mathrm{ref}}(y_l | x)}\right)\right]$$

Standard supervised-learning machinery suffices and DPO often matches PPO-RLHF performance with substantially less engineering complexity.

Video

Related terms: AI Alignment, Large Language Model

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.