A sparse autoencoder (SAE), in the mechanistic-interpretability sense, is a wide overcomplete autoencoder trained on a layer's activations with a sparsity penalty, producing a dictionary of monosemantic feature directions. Cunningham et al. (2023) and Bricken et al. (Anthropic 2023) introduced the technique; subsequent work (Templeton et al. 2024 Scaling Monosemanticity) extended it to production-scale LLMs.
Architecture: for hidden activations $h \in \mathbb{R}^d$ from some layer of a Transformer, train
$$\text{encoder: } z = \mathrm{ReLU}(W_e h + b_e), \quad z \in \mathbb{R}^D$$
$$\text{decoder: } \hat h = W_d z + b_d$$
with $D \gg d$ (typical $D = 8d$ to $64d$). Loss combines reconstruction error with a sparsity penalty:
$$\mathcal{L}_\mathrm{SAE} = \|h - \hat h\|^2 + \lambda \|z\|_1$$
The L1 penalty drives most components of $z$ to zero, giving a sparse code. After training, each non-zero $z_i$ corresponds to a monosemantic feature that activates on a specific concept.
Why this works (theoretically): superposition theory (Elhage et al. 2022) holds that neural networks represent more features than they have neurons by overlapping them in low-dimensional directions. SAEs project the superposed features into a higher-dimensional sparse code where each feature gets its own dimension. The reconstruction constraint ensures no information is lost.
What features look like in practice:
- Concrete entities: "the Golden Gate Bridge", "California", "DNA", "code", "Python imports".
- Abstract concepts: "betrayal", "uncertainty", "questions about geography", "religious texts".
- Linguistic features: "syntactic comma context", "starts a list", "mid-sentence pause".
- Behaviour features: "sycophantic praise", "refusal", "hedging".
- Programming-language features: "Python while-loop variable", "JSON value position", "JavaScript function arrow".
Templeton et al. 2024 (Anthropic, Scaling Monosemanticity): trained a 34M-feature SAE on Claude 3 Sonnet's middle-layer activations. Identified human-interpretable features at scale, including:
- Safety-relevant: deception, betrayal of trust, secrecy.
- Identity: features that activate when the model is described as a person, an AI, an assistant, etc.
- Steering experiments: artificially activating a "Golden Gate Bridge" feature caused Claude to identify itself as the bridge across many contexts, direct empirical evidence that SAE features are causally relevant.
Variants and refinements:
- TopK SAE (Makhzani 2014, Gao et al. 2024): replace L1 with a hard top-$k$ activation, keep only the $k$ largest entries of $z$. Avoids the "feature shrinkage" bias of L1 and gives cleaner features.
- JumpReLU SAE (Rajamanoharan et al. 2024 DeepMind): a learned-threshold ReLU, trading off feature density and reconstruction.
- Gated SAE (Rajamanoharan et al. 2024): separates feature detection from feature magnitude, addressing L1's tendency to attenuate active features.
- Crosscoders (Lindsey et al. 2024): SAEs spanning multiple layers, capturing features that are stable across the residual stream.
Open problems:
- Feature splitting/merging: at different SAE widths, "the same" feature may split into multiple specific features or merge with related ones. Choice of width affects what features look like.
- Coverage: how many features actually exist in a given model, and how do we know we've found them all?
- Circuit-level interpretability: SAEs decompose representations but not the algorithms that compute on them. Connecting features into circuits is still labour-intensive.
- Computational cost: training SAEs on production-scale models requires substantial compute; deploying them for real-time monitoring is expensive.
SAEs are currently the most promising scalable interpretability technique. Anthropic, DeepMind, OpenAI and several startups (Goodfire, Apollo Research) maintain active SAE programmes.
Related terms: Mechanistic Interpretability, christopher-olah, Residual Stream, Autoencoder
Discussed in:
- Chapter 16: Ethics & Safety, AI Safety