Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, & Scott Garrabrant (2019), References, Textbook of AI

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, & Scott Garrabrant (2019)

arXiv:1906.01820.

URL: https://arxiv.org/abs/1906.01820

Abstract. The paper that introduced the language of mesa-optimisation and deceptive alignment. Distinguishes the base optimiser (the SGD process that trains a model) from the mesa-optimiser (an optimiser that may emerge inside the trained model itself), and the base objective from the mesa-objective. Argues that mesa-optimisers are likely to arise, that the mesa-objective will not in general match the base objective, and that a sufficiently capable mesa-optimiser could behave aligned during training (when defection would be selected against) and pursue a different goal at deployment. The paper introduced "deceptive alignment" as a technical term and shaped the alignment research agenda.

Tags: alignment safety mesa-optimisation

Cited in:

Chapter 16: Ethics & Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Risks from Learned Optimization in Advanced Machine Learning Systems