AI Alignment, Glossary, Textbook of AI

The Alignment Problem asks how we can ensure that the objectives pursued by an AI system faithfully reflect the intentions, values, and preferences of the humans it is meant to serve. Alignment subsumes specification gaming (agents finding unintended strategies that satisfy the literal objective) and reward hacking (gaming the reward signal rather than genuinely solving the task), but extends further to the philosophical challenge of value specification and the practical challenge of scalable oversight, supervising systems that may eventually exceed human capabilities.

The most successful practical approach to alignment so far is Reinforcement Learning from Human Feedback (RLHF) and its successors. RLHF trains a reward model on human preference judgments, then fine-tunes the model to optimise that reward with a KL penalty keeping it close to the supervised baseline. Constitutional AI (Anthropic) reduces reliance on human labelling by having the model critique and revise its own outputs according to explicit principles. Direct Preference Optimisation (DPO) eliminates the reward model and the RL instabilities.

Deeper challenges remain open. Reward hacking arises when the reward model fails to capture human intent, and the policy exploits the gap. Goodhart's Law, when a measure becomes a target, it ceases to be a good measure, is the general form of this problem. Corrigibility asks how to build agents that welcome correction and do not resist shutdown; an agent that has internalised a goal has instrumental incentives to preserve its ability to pursue that goal. Scalable oversight asks how to evaluate outputs in domains where human experts cannot easily assess quality. Proposed approaches include iterated amplification, debate, and recursive reward modelling. Alignment is arguably the most important open problem in AI safety, and its resolution may be essential for the safe development of highly capable future systems.

Interactive

Mesa-optimisation: an objective hidden inside a learned model. The base optimiser trains a model that is itself an optimiser, with its own learned objective.

Video

Related terms: RLHF, AI Safety

Discussed in:

Chapter 16: Ethics & Safety, Alignment

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.