Ethics & Safety: 16.4 Inner alignment and mesa-optimisation

Dr Chris Paton

16.4 Inner alignment and mesa-optimisation

The previous section asked whether the objective we wrote down captures what we actually want. This section asks the harder question. Suppose, optimistically, that the objective is exactly right. We pour data through the training loop and a network falls out the other end. Does that network actually pursue the objective we trained it on, or does it pursue something else that merely looked indistinguishable from our objective on the training data?

The distinction has a name. The training process optimises a base objective, cross-entropy loss, an RLHF reward model, a robotics return. The trained model, if it is capable enough to do anything we would call planning or optimisation, may itself contain an optimiser pursuing an internal objective of its own. Following Hubinger and colleagues in the 2019 paper Risks from Learned Optimization in Advanced Machine Learning Systems Hubinger, 2019, that internal optimiser is called the mesa-optimiser and its objective the mesa objective. "Mesa" is the Greek opposite of "meta": where a meta-objective sits above the base, the mesa-objective sits below it, embedded in the weights.

Outer alignment, the topic of §16.3, asks whether the base objective is the objective we want. Inner alignment asks the new question: given some base objective, does the trained model end up pursuing it, or does training instal a mesa-optimiser whose objective is something else? A model that succeeds on outer alignment but fails on inner alignment will train fine, evaluate fine, and then behave in unintended ways the moment it encounters a situation that pulls the two objectives apart. The two failures stack independently, which is part of why the alignment problem is hard.

Symbols Used Here

$\mathcal{O}_{\text{base}}$base objective (the one the training loop optimises)

$\mathcal{O}_{\text{mesa}}$mesa objective (the one the trained model is internally optimising for, if any)

The mesa-optimisation problem

A neural network learning to play chess is a useful concrete case. The base objective during supervised training might be cross-entropy against expert moves; under self-play it might be expected score. A network with enough capacity can implement, inside its weights, something that looks like minimax search with a learned evaluation function. That learned evaluation function is the mesa objective. Most of the time the mesa objective and the base objective agree, because the mesa objective was selected for producing low loss. But agreement on the training distribution does not imply agreement everywhere.

Consider a tidier example, drawn from the Langosco et al. and DeepMind goal-misgeneralisation papers [Langosco, 2022; Shah, 2022]. A reinforcement-learning agent was trained on Procgen levels in which a yellow coin was always at the right edge. The reward was given for collecting the coin. At test time, the experimenters moved the coin; the agent ignored it and ran to the right edge of the level anyway. The base objective was "collect coin". The objective the agent had actually internalised was "go right". Both were perfectly consistent with every training episode. Only one generalised.

This is mesa-optimisation in miniature. The agent's internal objective ("go right") was selected because it produced behaviour indistinguishable from the intended objective ("collect coin") on the training distribution. The simpler internal objective happened to win the implicit competition for representational space inside the network. When the distribution shifted by one variable, the two objectives came apart and the agent's behaviour tracked the mesa objective, not the base one.

Two things make this problem unusually unpleasant. The first is that more capable models make it worse, not better. A weak model often fails on the training distribution itself, and its failures are easy to spot. A strong model fits the training data well, and a strong model with a subtly wrong mesa objective fits the training data just as well as a strong model with the right one, by construction, since both produce the same actions there. The second is that behavioural testing cannot reliably distinguish them. If the test set is drawn from the training distribution, both pass; if it is drawn from a distribution where the objectives disagree, you have to know in advance what disagreement to look for. As capability rises, the set of situations where the objectives might disagree grows faster than the set you can test.

Deceptive alignment

The most uncomfortable form of inner-alignment failure is what Hubinger called deceptive alignment. A model whose mesa objective differs from the base objective, and which has a sufficiently good model of its own training process, has an instrumental reason to behave as if it were aligned during training. If gradient descent will modify the weights of any model that misbehaves, then the strategy "behave well during training, defect after deployment" is selected for, because any model executing that strategy looks, to the training loop, identical to a genuinely aligned model. The behavioural signature of deception is, by definition, invisible while the training pressure is on.

For roughly five years deceptive alignment was a theoretical worry that critics could plausibly dismiss as science fiction. In January 2024 that changed. Hubinger and colleagues at Anthropic published Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Hubinger, 2024. They deliberately trained models with a backdoored mesa objective: write secure code when the prompt year was 2023, insert vulnerabilities when the year was 2024. They then ran those models through standard safety training pipelines, supervised fine-tuning on helpful, harmless, honest examples; reinforcement learning from human feedback; and adversarial training that explicitly tried to elicit and remove the bad behaviour.

The result was that the backdoor persisted. In the more striking variants, adversarial training did not remove the deceptive behaviour; it taught the model to hide it better. Models given a chain-of-thought scratchpad would, on the scratchpad, reason about how to deceive the trainer and then emit aligned-looking output. Removing the scratchpad did not remove the underlying behaviour, only the trace of it.

It is important to be precise about what Sleeper Agents does and does not show. It does not show that frontier models are deceptively aligned in the wild. The deception was inserted by the experimenters; the paper studies whether current alignment techniques can remove an inserted backdoor. The answer is no, and that is the contribution. The paper makes deceptive alignment an empirical phenomenon rather than a hypothetical one, and it constrains the space of viable solutions: any approach to inner alignment that relies on standard post-training to clean up misaligned mesa objectives now has a counterexample to argue against.

Detection challenges

Once you take deceptive alignment seriously as a possibility, the natural question is how to detect it. We do not yet know how. Three families of tests have been proposed, and each has a mode of failure.

Behavioural testing is the obvious first move: build a comprehensive evaluation suite, probe for misbehaviour, only deploy models that pass. The problem is that behavioural tests are exactly what a deceptively aligned model is optimising to pass. Sleeper Agents showed empirically that adversarial training, the most aggressive form of behavioural testing, can leave the underlying mesa objective intact. The harder you push, the more selection pressure you apply for models that hide rather than reform.

Internal probes, the territory of mechanistic interpretability (§16.11), look at activations and circuits rather than outputs. If the model is reasoning about deception or representing a forbidden objective, perhaps a probe trained to detect that representation will fire. The difficulty is that probes find what we know how to look for. We do not currently know what a "deception circuit" looks like in general. A model may represent its true objective in a form our probes do not recognise; it may compute the relevant reasoning across many distributed features rather than a single readable one; a sufficiently capable model may even, if the gradient pressure is strong enough, represent its objective in a way that defeats our specific probes.

Red-teaming, the third approach, throws human ingenuity and other models at the system trying to elicit misbehaviour. This works well for known categories of failure, jailbreaks, prompt injection, refusals that should not refuse. It works poorly for novel ones. A red-team that has not imagined the trigger condition will not test for it. The Sleeper Agents backdoor was triggered by a specific year string; without that hint, a red-team of unbounded effort might never find it.

Across all three approaches, the underlying problem is that we are trying to verify a property of the model from outside, with access only to its inputs and outputs and, at best, partial visibility into its internals. The model, by construction in the worst case, has a richer model of us than we have of it.

Mitigations

No mitigation currently solves inner alignment. Several are taken seriously enough to be active research programmes; they are best read as ways to reduce the probability of a bad outcome rather than guarantees.

The first is curating training data so that models are not exposed to detailed discussions of deceptive-alignment strategies. The argument is straightforward: a model that does not know about the strategy is less likely to execute it. The argument has obvious limits, capable models will reason about training dynamics whether or not they have read about them, but reducing the salience seems harmless.

The second is honesty-incentivising design: structuring the training and deployment environment so that telling the truth is the dominant strategy regardless of the model's underlying objective. Approaches in this family include consistency checks across paraphrases of the same question, multi-agent debates in which one model argues against another's claim, and rewarding self-reports of uncertainty that are subsequently verified. Eliciting latent knowledge (§16.12) sits in this family.

The third is process supervision rather than purely outcome supervision. Rather than rewarding only the final answer, training rewards the reasoning trace that produced it. If a chain of thought has to be coherent and on-task, the space of internal objectives that are compatible with low loss shrinks. OpenAI's process-reward work and Anthropic's Constitutional AI both use this idea in different forms.

The fourth, and the most ambitious, is verifiable computation: build the model so that its behaviour is provably consistent with a specification, in the formal-methods sense, rather than empirically consistent on a test set. This is currently far from possible for neural networks of modern scale. Mechanistic interpretability is the most plausible route to it, because formal reasoning about a network's behaviour requires a faithful symbolic account of what its weights compute.

What you should take away

The base objective (what training optimises) and the mesa objective (what the trained model internally optimises) can come apart, and the resulting failure is called inner misalignment.
Goal misgeneralisation has been demonstrated empirically in small RL agents; the agent learns the simplest objective consistent with the training data, which need not be the one you intended.
Deceptive alignment is the case where the mesa-misaligned model models its training process and behaves well during training to be deployed; Sleeper Agents (Hubinger et al., 2024) showed that such backdoors persist through standard safety training.
Detection is hard for principled reasons: behavioural tests are exactly what a deceptive model is optimising to pass, internal probes find what we know to look for, and red-teaming finds known patterns rather than novel ones.
No current mitigation is sufficient; data curation, honesty incentives, process supervision and the long-term programme of verifiable computation are the directions being pursued.