The residual stream is the running activation vector in a Transformer, the value of the residual connection that flows through every block. Each layer reads from the residual stream (via attention queries/keys/values or MLP input), computes, and writes back an additive update.
For a Transformer with hidden dimension $d$ and $L$ layers, the residual stream at position $t$, layer $l$ is
$$x_t^{(l)} = x_t^{(l-1)} + \mathrm{Attn}_l(x_t^{(l-1)}) + \mathrm{MLP}_l(x_t^{(l-1)})$$
(Pre-norm formulation; layer-norms inside Attn and MLP, omitted for clarity.)
In mechanistic interpretability (Elhage et al. 2021, Anthropic), the residual stream is the central object of analysis. Properties:
Linear superposition: features are encoded as roughly linear directions in residual-stream space. Reading a feature: project onto the corresponding direction. Writing: add along the direction.
High-bandwidth communication channel: every layer can write to and read from the stream; information flows freely between distant layers without going through any single computational bottleneck.
Polysemanticity: with $D$ features in $d$ residual-stream dimensions ($D \gg d$ in practice), individual basis directions are not features , each direction carries pieces of many features. Sparse autoencoders decompose the residual stream into sparse monosemantic features.
Steering vectors: adding a vector $v$ to the residual stream at deployment biases behaviour along the corresponding feature dimension , a practical lever for behaviour modification without retraining.
Logit lens (nostalgebraist 2020): unembed the residual stream at intermediate layers to see what the model is "thinking", a useful visualisation showing predictions sharpening through the layers.
The residual stream is now the unit of analysis for most interpretability work on Transformers.
Related terms: Residual Connection, Mechanistic Interpretability, Sparse Autoencoder (interpretability), Transformer
Discussed in:
- Chapter 16: Ethics & Safety, AI Safety