References

Goal Misgeneralization - Why Correct Specifications Aren't Enough For Correct Goals

Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, & Zac Kenton (2022)

arXiv:2210.01790.

URL: https://arxiv.org/abs/2210.01790

Abstract. DeepMind's companion paper to Langosco et al. on goal misgeneralisation. Provides a unified definition, additional empirical examples, including in modified Procgen environments and in a cultural-transmission task, and an extended analysis of why correct reward specifications can still yield incorrectly aligned policies when training distributions are insufficiently diverse. The paper argues that goal misgeneralisation is a generic out-of-distribution failure mode that can produce catastrophic mis-alignment even when reward specification is perfect, and shapes the agenda of diversity-of-training-environments approaches to alignment.

Tags: alignment safety reinforcement-learning

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).