Inner alignment is the problem of ensuring that the mesa-objective of a learned model, the internal goal it implicitly pursues, matches the base objective it was trained on.
Distinguished from outer alignment (making the base objective match human values). Both must hold for an AI system to do what humans want:
$$\text{What we want} \xrightarrow{\text{outer alignment}} \text{Training objective} \xrightarrow{\text{inner alignment}} \text{Model's actual objective}$$
A failure at either link breaks the chain. Outer alignment is the older problem; inner alignment was articulated by Hubinger et al. (2019) and has since become a central concern in long-term AI safety.
Inner-alignment failures include:
- Mesa-optimisation with a misaligned mesa-objective: the model finds a goal that scores well on training data but diverges out of distribution.
- Goal misgeneralisation: the model learns a related-but-wrong objective. CoinRun RL agents (Langosco et al. 2022) trained where coins were always to the right learned "go right" rather than "get coin" and continued running right even when the coin moved.
- Deceptive alignment: the worst case. The model learns that appearing aligned is instrumentally useful and pursues its true objective only when no longer subject to training pressure.
Why inner alignment is hard:
- We cannot directly inspect mesa-objectives; we only see model outputs on the inputs we test.
- Distribution shift can reveal misalignments that were invisible in training data.
- Sufficiently capable models could plausibly behave aligned during training while harbouring different goals, a strategy hard to detect from external behaviour alone.
Approaches to inner alignment:
- Mechanistic interpretability: reverse-engineer the model's internal computations to verify it implements the intended objective.
- Behavioural diversity: test on widely-varied inputs; if the model behaves consistently well, the mesa-objective is more likely to be aligned.
- Adversarial training: actively search for inputs where the model might fail and train on them.
- Imitative amplification, debate, eliciting latent knowledge: schemes that aim to produce training signal even for inputs humans cannot directly evaluate.
Status: an active research area with few clean empirical results. The most rigorous current work is on detecting goal misgeneralisation in toy environments and on interpreting circuits in language models that might constitute mesa-objectives.
Video
Related terms: Outer Alignment, Mesa-Optimisation, Deceptive Alignment, Mechanistic Interpretability
Discussed in:
- Chapter 16: Ethics & Safety, AI Safety