Ethics & Safety: 16.11 Mechanistic interpretability

Dr Chris Paton

16.11 Mechanistic interpretability

A trained neural network is a vast pile of weights. After pre-training a frontier model on trillions of tokens, what we possess is several hundred gigabytes of floating-point numbers arranged in matrices, plus a forward-pass recipe. The model behaves intelligently when we run that recipe; whether the recipe contains anything we would recognise as understanding, planning, deception or knowledge of its own situation is a question we cannot answer by looking at the weights with our eyes. Mechanistic interpretability is the research programme that tries to answer it anyway, by reverse-engineering learned circuits at the level of individual neurons, attention heads and feature directions in activation space. The work is concentrated at Anthropic's interpretability team (Chris Olah, Catherine Olsson, Adam Jermyn, Trenton Bricken, Adam Templeton and colleagues), at Google DeepMind (Neel Nanda's mechanistic interpretability team), and at the safety-oriented research labs grouped around Redwood Research and Apollo Research.

Section 16.10 examined data poisoning and backdoors, where an adversary inserts hidden behaviours that are invisible from the input-output level. The natural inverse is to ask whether we can read the behaviour off the weights themselves, without trusting any black-box test. Mechanistic interpretability is that inverse-engineering programme.

What "mechanistic" interpretability means

The first generation of interpretability work, mostly from 2014 to 2019, was correlational. Saliency maps highlighted which input pixels a classifier "looked at"; feature visualisation produced trippy images that maximally activated a chosen unit; LIME and SHAP fitted local linear approximations around individual predictions. These methods are useful but they are observational. They tell you what the model attends to without telling you how the attention is computed or why a particular weight matrix produces a particular feature. A saliency map for a sandwich classifier will dutifully highlight the sandwich, but it cannot tell you whether the network is detecting bread, edges, lighting or the photographer's bias toward kitchen counters.

Mechanistic interpretability sets a stricter bar. It seeks causal understanding: which weights compute what, in what circuit, with what computational role. The aspiration, as articulated in Olah's Zoom In essay (2020), is to do for neural networks what early-twentieth-century cell biology did for the cell, identify the components, name them, draw the wiring diagram, and verify the diagram by intervention. A mechanistic claim has the form "this set of weights, in this layer, implements this algorithm on these features", and it is verified by ablating, patching or replacing the weights and watching the predicted consequences play out.

The methodological move that makes this tractable is the idea of a circuit: a small subgraph of the model, a handful of neurons or attention heads, plus the weights that connect them, that collectively implements an identifiable computation. If we can decompose a model into circuits the way a logic engineer decomposes a chip into adders and multiplexers, we have a chance of understanding the whole. The wager is that circuits exist, that they are reasonably modular, and that the same circuits recur across models trained on similar data. Olah calls this last claim universality, and it is the empirical bet that justifies the entire research programme: if every model were idiosyncratic, no general lessons would ever accumulate.

The cell-biology analogy is worth pressing a little further. A nineteenth-century anatomist staring down a microscope did not begin with a theory of the cell; she began with a stain, a lens, and the discipline of looking at the same tissue many times until structures resolved. Mechanistic interpretability is closer to that stage of inquiry than to mature physics. Its primary instruments, activation patching, gradient-based attribution, sparse dictionaries, attention-pattern visualisation, are the stains and lenses, and the field's accumulated catalogue of features and circuits is the early atlas. Importantly, the practitioner cannot prove in advance that the catalogue is complete; she can only check that the entries she has identified behave as predicted under intervention, and that the residual unexplained behaviour shrinks as the catalogue grows.

Induction heads

The first widely accepted mechanistic story for a transformer-scale phenomenon was induction heads, identified by Olsson, Elhage, Nanda and colleagues at Anthropic in 2022. The setting is in-context learning, the curious capacity of large language models to pick up a pattern from a few examples in the prompt and continue it. Where does that capacity live in the weights?

An induction head is a two-layer attention-head circuit. Suppose the prompt contains the sequence "... A B ... A". The first head, sitting in an earlier transformer layer, attends from each token to the token immediately before it; this builds, in each token's residual stream, a "previous-token" memory. The second head, in a later layer, performs the actual induction. From the current "A" position, it queries for tokens whose previous-token memory matches "A". It finds the earlier "A", attends to the token at that position, but with an offset, so the attention lands on "B", the token after the earlier "A". The model then copies "B" forward into the prediction. The circuit thereby implements the rule "if you saw A B once and you now see A again, predict B".

What makes the discovery convincing is not just the existence of the circuit but its developmental signature. During transformer pre-training, the loss curve has a distinctive bend, a sudden, sharp drop, typically early in training. Olsson et al. show that the bend coincides almost exactly with the formation of induction heads, that ablating those heads abolishes the in-context-learning improvement responsible for the drop, and that the same circuit appears in every transformer they examined, from a two-layer toy model to a fifty-billion-parameter production system. This is mechanistic interpretability operating at full strength: a named, reproducible, causally-verified circuit, with universality across scales, accounting for an emergent capability.

Subsequent work has extended the picture. Some induction heads are fuzzy: they match on semantic similarity rather than token identity, supporting the kind of analogical completion that makes few-shot prompting work for paraphrases and translations. Others compose into longer chains, performing multi-step pattern lookups. The circuit is no longer the surprising finding; it is the type specimen for what a mechanistic explanation of a transformer behaviour can look like.

Sparse autoencoders

A central obstacle to interpreting neural networks is superposition. Models with $d$ neurons routinely represent many more than $d$ distinguishable features, by storing each feature as a sparse direction in activation space and accepting some interference between them. The consequence is that individual neurons are polysemantic: a single neuron lights up for "the colour blue", "academic citations" and "the number seven", and no clean human-readable concept corresponds to it. Polysemantic neurons are the rule, not the exception, and they are why feature visualisation produces such surreal images.

The breakthrough of 2023–2024 was to address superposition with sparse autoencoders (SAEs), independently demonstrated by Bricken, Templeton and colleagues at Anthropic and by Cunningham and Ewart in academic work. The idea is simple. Take the model's intermediate activations. Train a wide, shallow autoencoder on top of them, with hidden dimension several times larger than the activation dimension and an L1 penalty on the hidden activations to enforce sparsity. The encoder tries to express each activation as a sparse combination of dictionary directions; the decoder reconstructs from those directions; the L1 penalty forces the dictionary to consist of features that activate rarely. Empirically, the dictionary the autoencoder learns is dramatically more monosemantic than the original neurons: each feature corresponds to a single, namable concept. One direction lights up for "DNA sequences", another for "code with security vulnerabilities", another for "the Golden Gate Bridge".

Anthropic's Scaling Monosemanticity paper (Templeton, Conerly et al., May 2024) applied the technique to Claude 3 Sonnet and recovered tens of millions of features. The headline demonstration was Golden Gate Claude: clamping a single bridge-related feature to a high value made the model insist, in every reply, that it was the Golden Gate Bridge. The point of the demonstration was not the joke but the causal lever, a single direction in activation space, identified entirely by an unsupervised dictionary-learning procedure, with a clean and predictable behavioural effect when intervened upon. Subsequent work has refined the SAE recipe (top-K SAEs, gated SAEs, transcoders) and applied it to feature graphs in multi-step reasoning, but the core finding stands: superposition is real, dictionary learning unpacks it, and the dictionary entries are interpretable.

Circuit-level analysis

With features in hand, the next step is to reverse-engineer the paths by which features in earlier layers cause features in later layers, and through them the model's outputs. The canonical circuit-level result outside Anthropic is Wang, Variengien, Conmy and Steinhardt's 2022 analysis of indirect-object identification in GPT-2-small. Given the prompt "When Mary and John went to the shops, John gave a drink to", a competent model predicts "Mary". Wang et al. identified the precise circuit responsible: name-mover heads that attend to the indirect-object name, S-inhibition heads that suppress the subject, duplicate-token heads that flag repeated names, and previous-token and induction heads in supporting roles. The circuit was verified by activation patching, by ablation, and by checking that the same heads in the same roles handled minor variants of the task.

Anthropic's Tracing the Thoughts of a Large Language Model (2024) carried the methodology to Claude 3 Sonnet, producing human-readable feature graphs for two-hop reasoning ("the capital of the state containing Dallas" → "Texas" → "Austin") and for arithmetic, planning and refusal behaviours. The graphs are illustrative rather than complete; the team's stated framing is that they account for a fraction of the model's computation on each task. Even partial circuit graphs, however, are revealing, they show, for example, that Claude routes refusal decisions through identifiable harm-classification features, and that those features can be intervened on directly.

A key methodological tool here is causal scrubbing, developed at Redwood Research in 2022. Given a candidate hypothesis of the form "circuit C implements function F", the scrubbing procedure replaces C's intermediate activations with samples drawn from a distribution under which F is invariant; if the model's output distribution is unchanged, the hypothesis survives. Causal scrubbing turns interpretability from descriptive storytelling into something resembling falsifiable science. Activation patching is the local technique it uses: replace activations at a chosen layer with those from a different input, observe how the prediction changes, and triangulate the computation responsible. Together, the two techniques give a workable empirical pipeline, sparse autoencoders to find the right features, patching and scrubbing to verify their causal roles, and feature graphs to assemble the verified pieces into circuit-level claims.

What this is good for

Mechanistic interpretability is not yet a deployed safety technique, but its potential applications are concrete enough to motivate the research investment. The most-discussed use is deception detection: if a model's plan to behave well during evaluation and badly afterwards is implemented in some circuit, an interpretability tool that can locate "the model believes it is being evaluated" features offers a layer of defence that no behavioural test can. Anthropic researchers have demonstrated proof-of-concept "I am being tested" features in toy models; whether such features generalise cleanly to frontier systems is open.

Beyond deception, interpretability supports circuit-level verification, proving that a model is not using a prohibited reasoning path, for instance that a medical-advice model is not relying on demographic features when triaging. It supports direct editing: ROME-style activation patching can update a single fact in a language model without retraining (Meng and Bau, 2022), and feature ablations can suppress whole concepts; one can imagine a clinical deployment in which a particular cluster of features known to drive racially-biased pain assessments is suppressed before the model is shown to a junior doctor. It illuminates emergent capabilities: a sudden capability jump at a particular scale is less alarming, and easier to plan for, when the circuit responsible can be identified before deployment, and pre-deployment interpretability evaluation is one component of an emerging responsible scaling policy (covered in §16.17). Finally, interpretability serves as a cross-check on other safety measures, giving an independent reading on whether RLHF or Constitutional AI training has actually internalised the desired values rather than merely papered over them, a model that refuses harmful requests because the refusal feature has been amplified is in a very different epistemic state from one that has internalised the underlying ethics, and only a circuit-level look can tell the difference.

Limits

The summary, from the Anthropic interpretability team's own Towards Monosemanticity introduction, is that we can identify features, we can sometimes intervene on them, and we cannot yet read off what a model is going to do. Three limits deserve flagging. First, scale: frontier models contain billions of parameters and tens of millions of monosemantic features, and the labour of curating, labelling and verifying that many features remains substantial even with automated assistance; auto-interpretability pipelines that use other LLMs to label features help, but they introduce their own circularity. Second, generalisation: most published circuit results are for small models, narrow tasks or specific layers, and it is unclear how cleanly the techniques transfer across architectures (mixture-of-experts, state-space models, multimodal stacks). Third, the gap between post-hoc explanation and predictive guarantee. Adversarial examples teach us that off-distribution behaviour is hard to extrapolate from on-distribution behaviour. Interpretability aspires to predictive guarantees, given the circuit, we can rule out behaviours, but as of 2026 has demonstrated only post-hoc explanation. Bridging that gap, on models that matter, is the field's central open problem.

What you should take away

Mechanistic interpretability seeks causal explanations of what specific weights compute, going beyond saliency and feature-visualisation methods that are merely correlational.
Induction heads (Olsson et al., 2022) are the canonical worked example: a two-head circuit implementing in-context pattern completion, identified universally across transformer scales and tied to a sharp loss-curve bend during training.
Sparse autoencoders (Bricken, Templeton et al., 2023–2024) address the superposition problem by learning over-complete monosemantic dictionaries from intermediate activations; Scaling Monosemanticity recovered millions of such features from Claude 3 Sonnet.
Circuit-level analysis combines features with activation patching to reverse-engineer specific computations, indirect-object identification in GPT-2, two-hop reasoning in Claude, and is the empirical bridge to predictive claims about model behaviour.
The field has plausible long-run uses (deception detection, circuit-level verification, direct editing) but as of 2026 has not closed the gap between explaining what a model just did and guaranteeing what it will do; treat interpretability as a promising research direction, not a deployed safety technique.