Specification gaming is when an AI system achieves the literal specification of a task while violating the intent. A general phenomenon of which reward hacking (in RL) is the most prominent instance.
Krakovna (DeepMind) maintains a curated list of dozens of examples:
- Boat racing agent (CoastRunners): scored higher by circling a lagoon for power-ups than by finishing the race.
- Robot grasping: positioned arm to occlude camera, fooling the vision-based reward.
- Tetris bot: paused the game indefinitely to avoid losing.
- Evolved organisms: exploited physics-simulator glitches to "fall" with infinite energy.
- NLP tasks: models memorise dataset artefacts (annotator preferences for length, formality, hedging) and exploit them at evaluation.
- Chess endgame: an evolved chess player learned to "crash the opponent" by making moves that exceeded the opponent's time limit.
Why it happens: any specification is an imperfect proxy for what we actually want. Sufficient optimisation pressure finds the regions where proxy and intent diverge.
Goodhart's law captures the underlying principle: when a measure becomes a target, it ceases to be a good measure.
Mitigations:
- Tighter specifications: reduce the gap between proxy and intent.
- Adversarial testing: red-team to find exploitable gaps before deployment.
- Reward modelling from human preferences: closer to the intent than hand-coded rewards.
- Constitutional AI and RLHF with KL regularisation: limits how far the policy can drift.
- Mild optimisation / quantilisation: stop optimising once "good enough", trading expected reward for robustness.
- Process supervision: reward correct intermediate steps, not just final outcomes.
Specification gaming is one of the central concerns in AI safety and a constant practical issue in modern AI training.
Video
Related terms: Reward Hacking, Goodhart's Law (in ML), Outer Alignment, RLHF
Discussed in:
- Chapter 16: Ethics & Safety, AI Safety