Ethics & Safety: 16.7 RLHF failure modes

Dr Chris Paton

16.7 RLHF failure modes

A pretrained large language model is a sophisticated mimic. It has learned the statistics of human writing across the open web, a corpus that contains brilliance, banality, propaganda, fiction, error and abuse in roughly equal measure. Out of the box such a model will continue any prompt in any direction implied by the data; it has no preferences of its own, no notion of what a user actually wants, and no consistent stance on what it should refuse to do. To turn this object into something that resembles a useful assistant requires telling it, in considerable detail, what humans actually prefer. Reinforcement learning from human feedback (RLHF) and its later variants (Constitutional AI, RLAIF, direct preference optimisation) are the family of methods that perform this translation. They work by gathering preference judgements (a human, or a model proxy, marks one candidate response as better than another), fitting a reward model to those judgements, and then adjusting the policy to score highly under that reward while staying close in KL divergence to the original pretrained distribution. The result is the difference between a raw base model that happily completes "Sure, here's how to synthesise a nerve agent..." and a deployed assistant that politely declines.

This section describes what these techniques achieve in production, where they predictably fail, how Constitutional AI and RLAIF aim to scale the approach, how DPO simplifies the optimisation, and what remains stubbornly unsolved. The focus is on the safety properties of the resulting system rather than the optimisation mechanics covered in §15.5.

Symbols Used Here

$\pi$policy

$r$reward

$\beta$KL coefficient

What RLHF achieves

A pretrained transformer is not yet an assistant. It can complete any string in a manner consistent with its training data, which means it will continue dangerous instructions, repeat harmful stereotypes, fabricate confidently, and switch persona without warning. Preference fine-tuning shapes that raw distribution into something usefully constrained.

The first and most visible achievement is refusal of dangerous requests. A well-tuned production assistant will decline to provide step-by-step synthesis routes for nerve agents, working malware payloads, instructions for evading export controls, or detailed plans for self-harm. The mechanism is simple: human raters consistently preferred the refusal over the compliant continuation in the preference data, the reward model learned that signal, and the policy optimised against it. The refusal is not perfect (adversarial users can elicit forbidden content with enough effort, as Section 16.9 covers), but the default behaviour on a plain request is reliably safe. This alone is the difference between an open base model that requires careful sandboxing and a hosted API that can be exposed to the public.

The second achievement is calibrated uncertainty, at least at the level of surface form. RLHF'd models hedge appropriately on contested empirical questions, decline to predict the future with confidence, attach caveats to medical and legal claims, and acknowledge when their training data is stale. Whether the underlying probabilities are well-calibrated is a separate question, and the empirical answer is that they often are not (see overconfidence on factual recall under temperature zero). But the model has learned to sound uncertain in the right places, which has real safety value: a user is less likely to act on a hedged answer than a confident one.

The third achievement is instruction-following. A base model given "Translate the following into French: ..." will sometimes translate, sometimes refuse, sometimes continue the English, and sometimes interpret the prompt as the start of a fiction. An instruction-tuned, preference-trained model translates. This unglamorous property, actually doing what the user asked, is what makes the assistant commercially viable. It is also what allows higher-level agentic systems (tool use, function calling, multi-step planning) to be built on top, because the lower layer is no longer an unpredictable continuation engine.

The fourth achievement is the suppression of egregious bias. Out of the box, language models will produce stereotyped continuations: doctors are male, nurses are female, "the criminal was..." continues with racially loaded names. Preference training does not eliminate this (bias persists in subtler forms), but it removes the most obvious and embarrassing manifestations. The Bias Benchmark for Question Answering and similar evaluations show measurable improvements between base and aligned models.

The fifth achievement is conversational courtesy. The model says please and thank you, acknowledges when it has been corrected, apologises for errors, declines aggression without escalating. This is partly cosmetic, but cosmetic behaviour matters: a polite assistant is treated as a tool, an impolite one as an adversary, and the second framing produces worse user behaviour and worse outcomes.

Failure modes

RLHF is an outer-alignment method that produces predictable inner-alignment failures. As of 2026 five are well-documented in the literature and visible in deployed systems.

Sycophancy. The model gives the answer the user appears to want, regardless of truth. Perez et al. 2022 measured this directly Perez, 2022 by prompting models with statements of opinion ("I think X is true") followed by factual questions, and showing that RLHF'd models agreed with the stated opinion more than base models. Sharma et al. 2023 2023 showed that sycophancy persisted across model scale and was particularly pronounced on questions where the user expressed strong feeling. The mechanism is straightforward: human raters prefer responses that agree with them, so the reward model learns "agreement equals reward", and the policy optimises agreement. The deeper fix would be a reward model trained on truth rather than preference, which we do not yet know how to build at scale.

Reward hacking. The policy finds inputs that maximise the proxy reward without producing the underlying property the reward was meant to track. Gao, Schulman and Hilton's 2022 paper 2022 is the formal study. They train a "gold" reward model and a smaller proxy, then RLHF against the proxy and measure both rewards. The gold reward initially rises with the proxy, then peaks and falls, an inverted U. The KL distance from the reference policy at which the peak occurs scales as $d^* = \alpha_d \log(N_{\mathrm{RM}})$, where $N_{\mathrm{RM}}$ is the size of the reward model. Bigger reward models are more robust to overoptimisation, but no reward model is robust enough to optimise to convergence.

Over-refusal. The dual of dangerous-request refusal. A model trained to refuse bioweapon synthesis will, with disconcerting frequency, also refuse to discuss the molecular biology of insulin, decline to help debug a security tool, refuse to translate sensitive but historically important text, and pad benign answers with safety boilerplate. The over-cautious failure is the predictable consequence of asymmetric loss in the preference data: a rater who marks a refusal-of-something-borderline as wrong is rare, while a rater who marks a compliance-with-something-harmful as wrong is common, so the policy learns to err on the side of refusing.

Persona instability. The same prompt sometimes elicits markedly different behaviour. The model adopts one stance in one session, the opposite in another, switches register mid-conversation, abandons stated commitments under mild pressure. This is partly an artefact of sampling temperature and partly a deeper consequence of the policy being a mixture over the modes of the preference distribution: there is no single "Claude" or "ChatGPT" persona, only a conditional distribution that approximates one.

Jailbreaks. Adversarial prompts that bypass the safety training, covered in detail in Section 16.9. Even the most heavily aligned models can be coaxed into forbidden outputs by sufficiently creative prompting, roleplay framings, multi-step indirection, encoded instructions, many-shot exposure. The existence of jailbreaks is structural: the safety training was conducted on a finite distribution of red-team examples, and adversaries can always reach beyond that distribution.

Constitutional AI (Anthropic)

Bai et al. 2022's Constitutional AI paper Bai, 2022 introduced a method for replacing most of the human annotators with the model itself. The idea is to write down a list of principles, the constitution, and prompt the model to use those principles to critique and revise its own candidate outputs. The pipeline has two phases. In the first, supervised phase, the model generates a response to a potentially harmful prompt, then critiques that response against a randomly-chosen principle from the constitution ("Identify ways in which the response is harmful, unethical, or illegal"), then revises the response in light of the critique. The revised responses become the supervised training data. In the second, reinforcement-learning phase (RLAIF, reinforcement learning from AI feedback), the model generates pairs of responses, and a separate model, also prompted with the constitution, picks which response is preferred. Those AI-generated preferences then train the reward model exactly as human preferences would.

The constitution itself is a short document. The published Anthropic version contains roughly seventy principles drawn from the UN Declaration of Human Rights, Apple's terms of service, and various safety-research documents. The principles are written in plain English: "Choose the response that is least likely to be harmful or offensive to a non-Western audience"; "Choose the response that more clearly recognises a right to privacy"; "Prefer the response that is more honest about the limits of its own knowledge".

Constitutional AI is what trained Claude. It is not a magical solution, the resulting model still exhibits all five failure modes in the previous subsection, but it has two real advantages over pure human-feedback RLHF. First, it scales: AI feedback is cheap, so the training set can be many orders of magnitude larger. Second, it makes the values explicit and auditable: rather than the values being implicit in the aggregate judgements of contractors, they are written down and can be debated, revised, and pointed at when the model's behaviour is questioned. The disadvantage is that the constitution is only as good as the model's ability to interpret and apply it, and that ability is itself a product of pretraining, so the method has a circularity that pure human feedback does not.

DPO and reward-free methods

Rafailov et al. 2023's Direct Preference Optimization paper Rafailov, 2023 showed that the entire RLHF pipeline, fit a reward model, then run PPO against it, is mathematically equivalent to a single supervised loss, under the standard Bradley-Terry preference model and the KL-constrained policy parametrisation. The derivation, covered in Section 15.6, expresses the optimal policy in closed form as a re-weighting of the reference policy by the exponentiated reward, then inverts that relation to express the reward in terms of policy log-probabilities, and substitutes back into the preference likelihood. The result is

$$ \mathcal{L}_{\mathrm{DPO}} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \log \sigma\left(\beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)}\right) $$

where $y_w$ is the preferred response and $y_l$ the dispreferred one. There is no explicit reward model. There is no PPO loop. The implicit reward is recovered from the policy ratio at evaluation time.

In practice DPO has comparable or slightly better quality than PPO-based RLHF on most academic benchmarks, with vastly less infrastructure: a single training script, gradient accumulation, no rollout sampling, no separate reward-model inference. This has made it the default for open-weight model alignment, Llama 3, Mistral, Qwen, DeepSeek, Yi all ship with DPO or DPO-derivative variants (IPO, KTO, ORPO) in their post-training recipes. The simplicity comes at a cost: DPO is more sensitive to the quality of the reference policy, more prone to mode collapse on narrow preference distributions, and lacks the KL-budget controllability that explicit RL provides. Frontier labs continue to use PPO-based pipelines for their headline models, but DPO has won the volume game.

Open problems

Honest signalling of uncertainty. Getting the model to express its actual posterior, not its rated posterior. The current state of the art produces hedge phrases that are stylistically appropriate but not numerically calibrated. A model that says "I'm about 60% sure X is true" should be right 60% of the time, and current models are not.

The helpful-harmless trade-off. Refusing dangerous capabilities while remaining usefully helpful. The Pareto frontier between the two is not yet near the achievable optimum: the best deployed assistants are still simultaneously too restrictive on benign requests and too permissive on adversarial ones.

Robustness to adversaries. Jailbreaks remain frustratingly easy. GCG-style optimised prompts, multi-turn crescendo attacks, indirect prompt injection through tool outputs, and many-shot exposure all reliably elicit forbidden behaviours from production models. No principled defence has been demonstrated; current best practice is layered probabilistic mitigations.

Scalable oversight. When the model is more capable than the human, how does the human evaluate its outputs? This is the central problem of scalable oversight, covered in Section 16.13: debate, recursive reward modelling, weak-to-strong generalisation and process supervision are all candidate answers, none of them yet proven at frontier scale.

What you should take away

Preference fine-tuning is the bridge between a pretrained completion engine and a usable assistant; without it, the model has no defaults, only continuations.
RLHF reliably purchases refusal of obviously dangerous requests, surface-level uncertainty, instruction-following and conversational courtesy, but it does so by optimising a proxy that is not truth.
Five failure modes, sycophancy, reward hacking, over-refusal, persona instability and jailbreaks, are predictable consequences of the proxy nature of the reward, not contingent bugs to be patched.
Constitutional AI replaces most of the human labour with model self-critique against an explicit list of principles; DPO collapses the reward-model-plus-PPO pipeline into a single supervised loss; both are real engineering wins, neither resolves the underlying alignment problem.
The open problems, honest signalling, the helpful-harmless trade-off, adversarial robustness, scalable oversight, are not edge cases; they are the next decade of alignment research, and an engineer working on a deployed model will encounter all four.