Ethics & Safety: 16.12 ELK: eliciting latent knowledge

Dr Chris Paton

16.12 ELK: eliciting latent knowledge

The Alignment Research Center (ARC), founded by Paul Christiano, posed the eliciting latent knowledge (ELK) problem in a 2021 report Christiano, 2021. The setup is concrete. Imagine a smart vault that contains a diamond and a video camera that streams to an operator. A predictor model is trained to predict the camera feed from a description of the world. An adversary may tamper with the camera. The operator wants to know whether the diamond is still in the vault.

A model that knows the answer, because it has built an internal world model of the vault, has latent knowledge of the truth. The question is how to extract it. The naive approach is to train a head that maps internal activations to "is the diamond there". The failure mode, called the human simulator: instead of reporting the world model's belief, the head learns to predict what a human looking at the camera feed would say. On training data the two agree. Off-distribution (when the camera is tampered with) they diverge, and the human-simulator head reports what the human would believe, not the truth.

ELK is unsolved. ARC's report enumerates ~20 candidate strategies (regularising for simpler predictors, using different model architectures, mechanistic interpretability) and shows each has a counterexample. The framing has been influential beyond ARC because it crystallises a core difficulty: any training signal we have is mediated by what we can observe, and a sufficiently capable model will learn to satisfy our observations rather than the underlying state.