Mechanistic Interpretability, Glossary, Textbook of AI

Mechanistic interpretability is the project of reverse-engineering trained neural networks into human-understandable algorithms , identifying specific circuits, features and algorithmic steps that the network implements during inference. Distinguished from purely behavioural interpretability ("the network sees this image and labels it as a cat") which doesn't tell us how it labelled the image.

The motivating bet (Olah et al., Anthropic 2021–): the algorithms that trained neural networks implement may be much simpler than the networks themselves, comprising a smaller number of human-comprehensible building blocks. If we can identify these building blocks and their composition, we can audit, modify and trust the network.

Key constructs:

Features: human-meaningful concepts that activate specific patterns in the network. Early examples in vision CNNs: curve detectors, dog-snout detectors, car-wheel detectors (Olah's Distill circuits papers, 2020).

Circuits: patterns of weights that compose features into more complex computations. The CNN circuit that detects "high-low frequency boundary" (a wheel-like pattern) is composed of specific edge and curve detectors with specific weights.

Residual stream: in Transformers, the running sum of layer outputs that each layer reads from and writes to. Anthropic's Transformer Circuits programme treats the residual stream as the "communication channel" between attention heads and MLPs.

Induction heads (Olsson et al. 2022): a specific 2-layer attention pattern that causes copying of repeated subsequences. Implements basic in-context learning. One of the cleanest "circuits" identified in real Transformers.

Superposition (Elhage et al. 2022): networks may represent more features than they have neurons by encoding features as overlapping linear directions. Polysemanticity, a single neuron firing for multiple unrelated concepts, is the symptom.

Sparse autoencoders (SAEs): a recent (2023–2024) technique that addresses superposition. Train a wide overcomplete autoencoder on a layer's activations with a sparsity penalty, producing thousands of monosemantic feature directions. Cunningham, Templeton et al. (2023, 2024) demonstrated millions of features in production LLMs (Claude 3 Sonnet, GPT-2). This is currently the most promising scalable interpretability approach.

Attribution patching (Conmy et al. 2023): efficient causal-intervention method. To check whether component $X$ matters for behaviour $Y$, replace $X$'s activation with that from another input ("patch") and observe whether $Y$ changes. A scalable approximation of full causal-tracing experiments.

Activation steering (Turner et al. 2023): adding a vector to the residual stream during forward pass biases the model's behaviour. Works because features are roughly linear directions; offsetting along a feature direction nudges that feature's activation. Used both for interpretability research and for behaviour modification.

Practical applications:

Safety auditing: detect whether the model internally represents goals or knowledge incompatible with deployment requirements.
Debugging: find why the model behaves badly on certain inputs.
Behaviour modification: steer the model away from unwanted behaviours via activation interventions.
Compliance: provide auditable explanations for high-stakes decisions.

Limitations:

Does not yet scale: full interpretation of frontier-scale models remains far off; SAEs cover features but not circuits at scale.
Subjective: identifying a "feature" requires human judgment.
Polysemanticity is a serious obstacle that SAEs partly solve but not fully.
Causal vs correlational: many interpretability techniques find correlations between activations and behaviour without proving causation.

Mechanistic interpretability is the central technical research programme of Anthropic's safety team and a major thread at OpenAI, Google DeepMind, METR, Apollo Research, and Goodfire. The field has grown from a niche concern in 2020 to a multi-hundred-researcher subfield by 2026.

Video

Related terms: Sparse Autoencoder (interpretability), Induction Head, Residual Stream, christopher-olah, AI Safety

Discussed in:

Chapter 16: Ethics & Safety, AI Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).