Leo Gao, John Schulman, & Jacob Hilton (2022)
arXiv:2210.10760.
URL: https://arxiv.org/abs/2210.10760
Abstract. The first quantitative empirical study of reward hacking in RLHF. Trains a "gold" reward model and a smaller proxy and uses RL to optimise the proxy. Reports a clean Goodhart-style curve: as policy KL from the initial model grows, gold reward initially rises in lockstep with proxy reward, then peaks, then falls, while proxy reward continues climbing. Fits scaling laws to the location and depth of the peak as a function of proxy size, dataset size and KL. The paper is the standard empirical reference on the inevitability and shape of reward hacking under sustained optimisation pressure.
Tags: alignment rlhf reward-hacking