Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, & Shane Legg (2020), References, Textbook of AI

Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, & Shane Legg (2020)

Advances in Neural Information Processing Systems 33.

URL: https://arxiv.org/abs/2010.07877

Abstract. Develops Attainable Utility Preservation (AUP), an impact-regularisation method for reinforcement-learning agents. The agent is penalised for actions that reduce its ability to perform a wide set of auxiliary reward functions, encouraging policies that achieve their primary objective without unnecessary changes to the world. The paper formalises and generalises the earlier relative reachability idea of Krakovna et al. (2018) and demonstrates AUP on gridworld safety environments. Impact regularisation remains one of the speculative mitigations in the safe-RL literature.

Tags: alignment safety reinforcement-learning

Cited in:

Chapter 16: Ethics & Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Avoiding Side Effects By Considering Future Tasks