Residual Stream, Glossary, Textbook of AI

The residual stream is the running activation vector in a Transformer, the value of the residual connection that flows through every block. Each layer reads from the residual stream (via attention queries/keys/values or MLP input), computes, and writes back an additive update.

For a Transformer with hidden dimension $d$ and $L$ layers, the residual stream at position $t$, layer $l$ is

$$x_t^{(l)} = x_t^{(l-1)} + \mathrm{Attn}_l(x_t^{(l-1)}) + \mathrm{MLP}_l(x_t^{(l-1)})$$

(Pre-norm formulation; layer-norms inside Attn and MLP, omitted for clarity.)

In mechanistic interpretability (Elhage et al. 2021, Anthropic), the residual stream is the central object of analysis. Properties:

Linear superposition: features are encoded as roughly linear directions in residual-stream space. Reading a feature: project onto the corresponding direction. Writing: add along the direction.

High-bandwidth communication channel: every layer can write to and read from the stream; information flows freely between distant layers without going through any single computational bottleneck.

Polysemanticity: with $D$ features in $d$ residual-stream dimensions ($D \gg d$ in practice), individual basis directions are not features , each direction carries pieces of many features. Sparse autoencoders decompose the residual stream into sparse monosemantic features.

Steering vectors: adding a vector $v$ to the residual stream at deployment biases behaviour along the corresponding feature dimension , a practical lever for behaviour modification without retraining.

Logit lens (nostalgebraist 2020): unembed the residual stream at intermediate layers to see what the model is "thinking", a useful visualisation showing predictions sharpening through the layers.

The residual stream is now the unit of analysis for most interpretability work on Transformers.

Discussed in:

Chapter 16: Ethics & Safety, AI Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).