Victoria Krakovna, Laurent Orseau, Richard Ngo, Miljan Martic, & Shane Legg (2020)
Advances in Neural Information Processing Systems 33.
URL: https://arxiv.org/abs/2010.07877
Abstract. Develops Attainable Utility Preservation (AUP), an impact-regularisation method for reinforcement-learning agents. The agent is penalised for actions that reduce its ability to perform a wide set of auxiliary reward functions, encouraging policies that achieve their primary objective without unnecessary changes to the world. The paper formalises and generalises the earlier relative reachability idea of Krakovna et al. (2018) and demonstrates AUP on gridworld safety environments. Impact regularisation remains one of the speculative mitigations in the safe-RL literature.
Tags: alignment safety reinforcement-learning
Cited in: