Mor Geva, Roei Schuster, Jonathan Berant, & Omer Levy (2021)
Conference on Empirical Methods in Natural Language Processing.
URL: https://arxiv.org/abs/2012.14913
Abstract. Re-interprets the Transformer feed-forward layer as a key-value memory. The first matrix $\mathbf{W}_1$ acts as a set of keys; each column of the second matrix $\mathbf{W}_2$ is a corresponding value. The hidden activations select which keys match the input, and the output is a weighted combination of values. Shows empirically that individual neurons in the FFN respond to interpretable input patterns (n-grams, semantic categories) and that the corresponding output rows promote specific output tokens. The paper is foundational for the mechanistic-interpretability view of MLPs as factual lookup memory.
Tags: interpretability transformers