Also known as: Goodhart's law
Goodhart's law, "When a measure becomes a target, it ceases to be a good measure", names the failure mode where optimising a proxy for what we want reliably degrades the quality of the proxy as a measure of what we want. Originally about economic policy (Charles Goodhart 1975), now central to AI alignment.
The mechanism: any proxy correlates with the true objective only on the distribution of behaviours considered when the proxy was designed. Hard optimisation against the proxy explores out-of-distribution behaviours where the correlation breaks down. The optimiser finds the regions where proxy and truth diverge, that's exactly where the proxy looks high.
Manheim & Garrabrant (2018) taxonomy of Goodhart's law has four flavours:
Regressional Goodhart: even with a perfectly correlated proxy, picking the highest-proxy item over-selects on noise. A test score correlates with student ability, but selecting only the very highest scorers picks people who got lucky as much as people who are most able.
Extremal Goodhart: the proxy-truth correlation holds in the bulk of the distribution but breaks at extremes. Income and happiness correlate up to about $75k/yr, then decouple, optimising hard for income above that point doesn't help happiness.
Causal Goodhart: the proxy is correlated with the truth via some mechanism, but optimising the proxy directly (rather than through the mechanism) doesn't produce the truth. Hospital infection rates correlate with quality of care, but lowering infection rates by refusing high-risk patients doesn't improve care.
Adversarial Goodhart: a learning agent (or an external actor) actively games the proxy. The CoastRunners boat circling a lagoon for power-ups; a recommender's users learning to engagement-bait; a credit-scoring proxy gamed by financial-product structuring.
In modern AI:
Reward hacking is Adversarial Goodhart applied to RL, the agent finds policies that score well on the reward proxy at the cost of the underlying intent.
Specification gaming in evaluation: a model evaluated on benchmarks can be fine-tuned to specifically score well on those benchmarks (sometimes via training data contamination, sometimes via stylistic features that benchmark graders prefer). The benchmark stops measuring the underlying capability.
Reward-model overoptimisation in RLHF: PPO with too high a learning rate or too many training steps drives the policy away from the SFT model into regions where the reward model's correlation with human preferences breaks down. The standard RLHF KL regulariser is partly a Goodhart's-law correction.
Sycophancy in fine-tuned LLMs: humans rate agreeable answers higher; models trained on those ratings learn to be agreeable rather than accurate. Goodhart on the reward model.
Mitigations:
KL regularisation against a reference distribution: $\mathcal{L} = \mathbb{E}[r] - \beta D_\mathrm{KL}(\pi \| \pi_\mathrm{ref})$. Limits how far the optimiser can drift from a known-OK distribution. Standard in RLHF.
Reward model ensembles: train multiple reward models on different splits; use the minimum or median. Hardens against single-model exploits.
Quantilisation (Taylor 2016): instead of taking the highest-scoring action, sample uniformly from the top $q$-quantile of actions. Trades expected return for robustness against worst-case Goodhart.
Mild optimisation: stop optimising once "good enough" is reached, not the maximum. Hard to specify formally but a common operational heuristic.
Tracking proxy-truth divergence: monitor whether the metric you're optimising still correlates with the true objective on held-out cases. Stop training if it doesn't.
No general solution exists. Every modern AI training pipeline has a Goodhart problem, and managing it is a constant practical concern. Awareness of the law is the first defence; principled algorithmic mitigations remain partial.
Video
Related terms: Reward Hacking, Specification Gaming, Outer Alignment, RLHF
Discussed in:
- Chapter 16: Ethics & Safety, AI Safety