Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, & Chris Olah (2022)
Anthropic.
URL: https://arxiv.org/abs/2209.11895
Abstract. Anthropic's mechanistic-interpretability paper on the emergence of in-context learning. Identifies induction heads, pairs of attention heads that together implement the operation "find the previous occurrence of the current token and copy what came after it", and shows that the abrupt phase transition in in-context-learning capability during training coincides with the formation of induction heads. The paper provides circuit-level evidence that in-context learning is a discrete capability acquired through a specific architectural mechanism, not a smooth scaling phenomenon, and is foundational to the mechanistic-interpretability literature.
Tags: interpretability language-models in-context-learning