Lauro Langosco di Langosco, Jack Koch, Lee D. Sharkey, Jacob Pfau, Laurent Orseau, & David Krueger (2022)
International Conference on Machine Learning.
URL: https://arxiv.org/abs/2105.14111
Abstract. Provides a clean experimental demonstration of goal misgeneralisation, the failure mode in which an agent learns a related but incorrect goal that is indistinguishable on the training distribution but diverges out of distribution. The CoinRun experiment is iconic: an agent is trained to reach a coin always placed on the right of the level; at test time, with the coin moved to a random location, the agent walks past the coin to the right edge. The paper introduces the term, gives a taxonomy of misgeneralisation modes and provides controlled examples that have become standard alignment-discussion case studies.
Tags: alignment safety reinforcement-learning
Cited in: