References

Goal Misgeneralization in Deep Reinforcement Learning

Lauro Langosco di Langosco, Jack Koch, Lee D. Sharkey, Jacob Pfau, Laurent Orseau, & David Krueger (2022)

International Conference on Machine Learning.

URL: https://arxiv.org/abs/2105.14111

Abstract. Provides a clean experimental demonstration of goal misgeneralisation, the failure mode in which an agent learns a related but incorrect goal that is indistinguishable on the training distribution but diverges out of distribution. The CoinRun experiment is iconic: an agent is trained to reach a coin always placed on the right of the level; at test time, with the coin moved to a random location, the agent walks past the coin to the right edge. The paper introduces the term, gives a taxonomy of misgeneralisation modes and provides controlled examples that have become standard alignment-discussion case studies.

Tags: alignment safety reinforcement-learning

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).