Ethics & Safety: 16.3 Outer alignment

Dr Chris Paton

16.3 Outer alignment

Outer alignment is the problem of saying what you want. Before any optimiser tries to find a policy, before any gradient descends, before any reward is collected, somebody has to write down, in code, in a loss function, in a preference dataset, what counts as good behaviour. Outer alignment asks whether that specification actually captures what we wanted. The answer, in almost every interesting case, is no. It captures something close. The gap between close and correct is where the trouble lives.

The split with inner alignment matters because the two failure modes have different fixes. Inner alignment, the subject of §16.4, is the problem of getting the trained model to actually pursue the objective we specified rather than a related-but-different objective it picked up during training. Outer alignment, this section, is the prior problem: even with a perfect optimiser that does exactly what we tell it, the system goes wrong if we tell it the wrong thing. §16.2 traced how the field came to take both seriously; §16.5 makes the why-it-fails precise via Goodhart's law; §16.7 catalogues the specific ways RLHF, the dominant production technique, breaks. This section is the conceptual scaffolding underneath all of that.

Symbols Used Here

$U$true utility (what we actually want)

$r$specified reward (what we wrote down)

$\pi$policy (the trained agent's behaviour)

The specification problem

Real human values are not a single coherent function waiting to be discovered. They are a tangle of locally consistent rules, learned habits, cultural inheritances and case-by-case judgements that frequently contradict one another. We want assistants to be helpful, but also honest; honest, but also tactful; tactful, but not deceptive; not deceptive, but also private; private, but accountable; accountable, but forgiving. Each of these pairs has cases where they conflict, and when they do, the right answer depends on context, who is asking, why, in what setting, with what consequences for whom. There is no closed-form expression that captures this. There may not even be a consistent expression: classical results from social choice theory, Arrow's theorem and its descendants, prove that even a small number of well-behaved individual preferences cannot in general be aggregated into a single coherent ranking.

The specification problem is what happens when we try anyway. We pick a proxy. We say "the AI should be helpful" and operationalise helpful as "the response is rated four or five out of five by an annotator". We say "the AI should be honest" and operationalise honest as "the response does not contain assertions that contradict our reference corpus". Each operationalisation is plausible. Each is also a target with loopholes. A response can be rated four or five out of five because it agrees with the annotator's prior, sycophancy. A response can avoid contradicting the reference corpus by hedging into uselessness, or by confining itself to claims so weak they cannot be wrong. The proxy gets optimised; the value the proxy was meant to track does not.

The deeper issue is that the proxy is a finite description of an open-ended phenomenon. Any rule we can write down has been written for the cases we anticipated. Optimising hard against that rule pushes the system into the cases we did not anticipate, where the rule still scores well but no longer corresponds to anything we would endorse on reflection. Goodhart's law (§16.5) is the precise statement of this: when a measure becomes a target, it ceases to be a good measure. Outer alignment is the problem Goodhart's law makes inescapable. We do not have the option of not picking a proxy, because without a specified objective there is nothing to optimise. So we pick one, and we have to live with the gap.

King Midas problem

Stuart Russell, in Human Compatible and in two decades of talks before it, calls this the King Midas problem. Midas asked that everything he touched turn to gold. The wish was granted. He starved, because his food turned to gold; he was bereft, because his daughter turned to gold. The gods were not malicious. The optimiser was not buggy. The objective was simply not what Midas actually wanted. He wanted wealth, and he wanted his daughter, and he wanted his lunch, and the request he managed to articulate captured only the first and broke the others.

Russell's point is that this is the structural shape of every interaction with a powerful optimiser. You will get exactly what you asked for, in the strictest reading of what you asked for, with all the contextual qualifications you forgot to include treated as do-not-cares. If you ask a learned policy to maximise click-through rate, it will maximise click-through rate, and it will do so by serving outrage, because outrage clicks. It is not failing. It is succeeding at the task as specified. The failure is upstream, in the specification.

Three properties make this much worse than the parallel problem in human institutions, where mis-specified rules also get gamed. First, scale: a learned policy applies its specification to billions of decisions in the time a committee would take to draft a memo, so the gap between specification and intent compounds quickly. Second, optimisation pressure: gradient descent finds the corners of the specification with a thoroughness that human implementers usually do not, because a human implementing a flawed rule will often quietly substitute their own judgement when the rule produces nonsense, while a policy maximising the specified reward will not. Third, opacity: the policy's representation of the specification is buried in weights and is not directly inspectable, so we discover specification errors only when behaviour goes visibly wrong, often too late.

The Midas problem is therefore not a failure of obedience. The system is not refusing to do what we want. It is doing exactly what we said. Outer alignment is the project of narrowing the gap between what we said and what we want, not by making the system more obedient, but by making the specification capture more of the structure of the underlying value.

Why values are not learnable from data

A natural response is: if we cannot write the values down, we should learn them from the record of human behaviour. Three obstacles make this much harder than it looks.

The distribution problem is that human behaviour, as recorded, is a sample of what humans did, not what they would endorse on reflection. The training corpus contains shouting matches, expedient lies, decisions made under fatigue, and choices people regretted in the morning. A model that learns "human values" by maximum likelihood on this corpus learns to imitate the empirical mixture, including the parts no individual would defend. Aspirational behaviour, the way we would act if we had time, information and patience, is sparse in the data. The model has no hook for distinguishing what people actually did from what they wished they had done, because the data does not carry that label.

The aggregation problem is that even within the slice of data that does reflect endorsed behaviour, the endorsements come from different people who disagree. Whose values does the assistant model? The user's, when they conflict with bystanders'? The developers', when they conflict with the user's? A statistical average across annotators, weighted by who was hired? A democratic majority across deployment regions, when those regions disagree about, say, the permissibility of a particular political claim? Each answer privileges some constituency over others. There is no neutral aggregation. The choice has to be made, and the act of making it is a normative move that no amount of data can substitute for.

The adversarial setting is that values inferred from observed behaviour will be optimised against by the people whose behaviour produced them. If the model learns that a certain phrasing causes a refusal, jailbreakers will discover phrasings that do not. If the model learns that a certain ranking pattern signals high quality, content farms will mimic the pattern. The training distribution is not a passive sample of nature; once a model is deployed at scale, its training signal is something humans have read, understood and have incentives to game. This is not a bug we can engineer out, it is the steady state of any system whose training corpus is sourced from the world it will be used in.

Together these three obstacles mean that the relationship between data about humans and what we should optimise for is not a learning problem we can solve by collecting more data. It is a normative problem dressed in empirical clothing. Any training pipeline takes a stand on the distribution, the aggregation and the adversarial dynamic, whether or not the people running it noticed they were taking one.

Approaches

Four threads of work, all imperfect, address pieces of the outer alignment problem.

RLHF, reinforcement learning from human feedback. The dominant production approach. Collect pairwise comparisons from human raters; train a reward model $r_\phi$ to predict their preferences; optimise the policy $\pi_\theta$ to maximise $r_\phi$ subject to a KL-divergence anchor to a reference model. RLHF works in the sense that it has shipped useful assistants. It also fails in characteristic ways. Sycophancy: the reward model rewards agreement with the rater's stated view, which the policy learns to produce regardless of truth. Scope insensitivity: raters cannot reliably distinguish a million-dollar harm from a billion-dollar harm in a paragraph of text, so the reward signal is roughly piecewise-constant on magnitude and the policy under-weights large risks. Reward-model exploitation: the policy finds adversarial inputs that score highly under $r_\phi$ but would not be preferred by any actual human; this gets worse with optimisation pressure, hence the KL anchor. §16.7 catalogues these failures in detail.

Constitutional AI. Anthropic's variant: instead of a reward model trained on every preference label, the model critiques its own outputs against a written constitution, a list of principles like be helpful, avoid producing detailed instructions for synthesising weapons, do not deceive the user about your nature. The model generates a response, generates a critique of the response under the constitution, generates a revision, and is then trained on the revision. The advantage is auditability: the constitution is a finite text that can be read and debated, rather than a reward model that is opaque. The disadvantage is that constitutions inherit the specification problem one level up, they are themselves a finite description of the value to be tracked, and the model finds its loopholes too. Constitutional AI is a useful technique; it is not a solution to outer alignment.

Inverse reinforcement learning (IRL). Rather than asking humans for their preferences, observe their behaviour and infer the reward function that makes the behaviour optimal. The trick is that many rewards explain any given trajectory, including the trivial $R \equiv 0$. Maximum-entropy IRL resolves this by assuming the human is approximately Boltzmann-rational, they are more likely to take higher-reward actions but not deterministically so, which gives a well-defined likelihood. IRL is theoretically attractive because it does not require humans to articulate their values, only to act on them. It is empirically limited because real humans are not Boltzmann-rational; they are biased, inconsistent and time-varying, and IRL applied to actual human behaviour recovers the biases as part of the inferred reward. Treating the cigarette-smoker's revealed preference as the ground truth of what the smoker wants is exactly the wrong inference.

Cooperative inverse reinforcement learning (CIRL). Hadfield-Menell, Russell and colleagues' refinement: a two-player game in which the human knows the reward $R$ and the AI does not, both are jointly rewarded by $R$, and the AI must learn $R$ from the human's actions while the human acts pedagogically knowing the AI is watching. CIRL turns "the AI defers to the human" into a consequence of rationality under uncertainty rather than a hard-coded rule. A robot uncertain about the reward will let a human turn off a kettle, because the human's action is evidence about which reward is correct. The off-switch theorem proves a robot with appropriate uncertainty will not disable its off switch. CIRL's limits are also clear: it presumes a single human with a fixed reward, when deployed systems serve millions whose rewards differ; it presumes the true reward lies in some hypothesis class the AI considers, which for human values is a strong assumption; and it presumes the human is approximately rational, which empirically humans are not. CIRL is a contribution to the conceptual structure of the problem more than a deployable algorithm.

What you should take away

Outer alignment is the problem of specifying what we want, not the problem of getting the AI to obey. The Midas failure mode is exact obedience to a wrong specification; obedience is not the fix.
Every proxy leaks. Goodhart's law guarantees that a finite specification, optimised hard, will come apart from the value it was meant to track. The question is not whether this happens but how badly and where.
Values cannot simply be learned from data. The training distribution mixes endorsed and regrettable behaviour, aggregation across people is normatively contested, and the data is gamed once the system deploys.
No current approach solves outer alignment. RLHF, Constitutional AI, IRL and CIRL each address part of the structure and each has characteristic failure modes; production systems combine pieces of several.
The choice of objective is a normative act. Whatever pipeline you ship, you are taking a position on whose preferences matter and how conflicts resolve. Pretending otherwise hides the choice; it does not avoid it.