Sparse Autoencoder (interpretability), Glossary, Textbook of AI

A sparse autoencoder (SAE), in the mechanistic-interpretability sense, is a wide overcomplete autoencoder trained on a layer's activations with a sparsity penalty, producing a dictionary of monosemantic feature directions. Cunningham et al. (2023) and Bricken et al. (Anthropic 2023) introduced the technique; subsequent work (Templeton et al. 2024 Scaling Monosemanticity) extended it to production-scale LLMs.

Architecture: for hidden activations $h \in \mathbb{R}^d$ from some layer of a Transformer, train

$$\text{encoder: } z = \mathrm{ReLU}(W_e h + b_e), \quad z \in \mathbb{R}^D$$

$$\text{decoder: } \hat h = W_d z + b_d$$

with $D \gg d$ (typical $D = 8d$ to $64d$). Loss combines reconstruction error with a sparsity penalty:

$$\mathcal{L}_\mathrm{SAE} = \|h - \hat h\|^2 + \lambda \|z\|_1$$

The L1 penalty drives most components of $z$ to zero, giving a sparse code. After training, each non-zero $z_i$ corresponds to a monosemantic feature that activates on a specific concept.

Why this works (theoretically): superposition theory (Elhage et al. 2022) holds that neural networks represent more features than they have neurons by overlapping them in low-dimensional directions. SAEs project the superposed features into a higher-dimensional sparse code where each feature gets its own dimension. The reconstruction constraint ensures no information is lost.

What features look like in practice:

Concrete entities: "the Golden Gate Bridge", "California", "DNA", "code", "Python imports".
Abstract concepts: "betrayal", "uncertainty", "questions about geography", "religious texts".
Linguistic features: "syntactic comma context", "starts a list", "mid-sentence pause".
Behaviour features: "sycophantic praise", "refusal", "hedging".
Programming-language features: "Python while-loop variable", "JSON value position", "JavaScript function arrow".

Templeton et al. 2024 (Anthropic, Scaling Monosemanticity): trained a 34M-feature SAE on Claude 3 Sonnet's middle-layer activations. Identified human-interpretable features at scale, including:

Safety-relevant: deception, betrayal of trust, secrecy.
Identity: features that activate when the model is described as a person, an AI, an assistant, etc.
Steering experiments: artificially activating a "Golden Gate Bridge" feature caused Claude to identify itself as the bridge across many contexts, direct empirical evidence that SAE features are causally relevant.

Variants and refinements:

TopK SAE (Makhzani 2014, Gao et al. 2024): replace L1 with a hard top-$k$ activation, keep only the $k$ largest entries of $z$. Avoids the "feature shrinkage" bias of L1 and gives cleaner features.
JumpReLU SAE (Rajamanoharan et al. 2024 DeepMind): a learned-threshold ReLU, trading off feature density and reconstruction.
Gated SAE (Rajamanoharan et al. 2024): separates feature detection from feature magnitude, addressing L1's tendency to attenuate active features.
Crosscoders (Lindsey et al. 2024): SAEs spanning multiple layers, capturing features that are stable across the residual stream.

Open problems:

Feature splitting/merging: at different SAE widths, "the same" feature may split into multiple specific features or merge with related ones. Choice of width affects what features look like.
Coverage: how many features actually exist in a given model, and how do we know we've found them all?
Circuit-level interpretability: SAEs decompose representations but not the algorithms that compute on them. Connecting features into circuits is still labour-intensive.
Computational cost: training SAEs on production-scale models requires substantial compute; deploying them for real-time monitoring is expensive.

SAEs are currently the most promising scalable interpretability technique. Anthropic, DeepMind, OpenAI and several startups (Goodfire, Apollo Research) maintain active SAE programmes.

Related terms: Mechanistic Interpretability, christopher-olah, Residual Stream, Autoencoder

Discussed in:

Chapter 16: Ethics & Safety, AI Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).