Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, & Zac Kenton (2022)
arXiv:2210.01790.
URL: https://arxiv.org/abs/2210.01790
Abstract. DeepMind's companion paper to Langosco et al. on goal misgeneralisation. Provides a unified definition, additional empirical examples, including in modified Procgen environments and in a cultural-transmission task, and an extended analysis of why correct reward specifications can still yield incorrectly aligned policies when training distributions are insufficiently diverse. The paper argues that goal misgeneralisation is a generic out-of-distribution failure mode that can produce catastrophic mis-alignment even when reward specification is perfect, and shapes the agenda of diversity-of-training-environments approaches to alignment.
Tags: alignment safety reinforcement-learning
Cited in: