Paul Christiano, Ajeya Cotra, & Mark Xu (2021)
Alignment Research Center.
URL: https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/
Abstract. The Alignment Research Center's foundational ELK report. Frames the problem of eliciting knowledge from a learned predictor that "knows more" than its training labels can directly express. Uses the worked example of a smart vault containing a diamond observed by a video camera: a predictor trained on operator approvals will report "the diamond is safe" both when the diamond is genuinely safe and when an attacker has spoofed the camera; the predictor knows the difference internally but the training signal cannot pull that knowledge out. ELK is the canonical statement of one of alignment's hardest problems and has shaped the agenda of multiple safety teams.
Tags: alignment safety interpretability
Cited in: