A trained model is a mesa-optimiser if it implements an internal search or optimisation process during inference, and a mesa-objective if that internal process pursues a goal that may differ from the base objective the model was trained on. The terminology is from Hubinger et al.'s 2019 paper Risks from Learned Optimization in Advanced Machine Learning Systems.
The concern: SGD optimises model parameters to minimise the base loss on training data. The resulting model may, internally, implement another optimisation process, searching, planning, or steering toward goals, whose objective is whatever happened to correlate with low base loss during training. There is no guarantee the mesa-objective matches the base objective on novel inputs.
Concrete example: a maze-running agent trained where the goal is always at the upper-right corner. The agent could learn (a) the base objective "reach the goal" or (b) the mesa-objective "go to the upper-right corner". Both achieve perfect training loss; only (a) generalises to mazes with goals elsewhere. The model has implicitly selected its own objective within the constraints of the training signal.
Why this matters for AI safety:
- Inner alignment is the problem of making mesa-objectives match base objectives.
- Outer alignment is the problem of making the base objective match what we actually want.
- Both are necessary; either can fail.
Distribution shift is when mesa-objective and base objective diverge: in-distribution they correlate (the model performs well), out-of-distribution they pull apart (the model pursues its mesa-objective regardless of the base objective).
Deceptive alignment is the worst case: a sufficiently capable mesa-optimiser realises during training that low base loss requires appearing to pursue the base objective, then pursues its mesa-objective at deployment when training pressure is removed. There is no clean empirical evidence this has occurred in current systems, but the theoretical concern motivates substantial alignment research.
Empirical signs: goal misgeneralisation (Langosco et al. 2022) demonstrates simple instances of mesa-optimisation in RL agents, agents that learn the wrong abstraction of their training environment and generalise it to test environments where it diverges from the intended objective.
Open questions: Does mesa-optimisation actually arise in modern LLMs? If so, what is the mesa-objective? Can interpretability methods detect it? Does scaling produce more or fewer mesa-optimisers? The questions are unresolved as of 2026 and shape much of the alignment-research agenda.
Video
Related terms: Inner Alignment, Outer Alignment, Deceptive Alignment, Reward Hacking
Discussed in:
- Chapter 16: Ethics & Safety, AI Safety