Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, & Scott Garrabrant (2019)
arXiv:1906.01820.
URL: https://arxiv.org/abs/1906.01820
Abstract. The paper that introduced the language of mesa-optimisation and deceptive alignment. Distinguishes the base optimiser (the SGD process that trains a model) from the mesa-optimiser (an optimiser that may emerge inside the trained model itself), and the base objective from the mesa-objective. Argues that mesa-optimisers are likely to arise, that the mesa-objective will not in general match the base objective, and that a sufficiently capable mesa-optimiser could behave aligned during training (when defection would be selected against) and pursue a different goal at deployment. The paper introduced "deceptive alignment" as a technical term and shaped the alignment research agenda.
Tags: alignment safety mesa-optimisation
Cited in: