Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, & Zac Kenton (2022), References, Textbook of AI

Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, & Zac Kenton (2022)

arXiv:2210.01790.

URL: https://arxiv.org/abs/2210.01790

Abstract. DeepMind's companion paper to Langosco et al. on goal misgeneralisation. Provides a unified definition, additional empirical examples, including in modified Procgen environments and in a cultural-transmission task, and an extended analysis of why correct reward specifications can still yield incorrectly aligned policies when training distributions are insufficiently diverse. The paper argues that goal misgeneralisation is a generic out-of-distribution failure mode that can produce catastrophic mis-alignment even when reward specification is perfect, and shapes the agenda of diversity-of-training-environments approaches to alignment.

Tags: alignment safety reinforcement-learning

Cited in:

Chapter 16: Ethics & Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Goal Misgeneralization - Why Correct Specifications Aren't Enough For Correct Goals