Attention & Transformers: 13.4   Self-, cross-, and encoder–decoder attention

Dr Chris Paton

13.4 Self-, cross-, and encoder–decoder attention

§13.2 introduced scaled dot-product attention as a single mathematical operation that takes three matrices, queries $\mathbf{Q}$, keys $\mathbf{K}$ and values $\mathbf{V}$, and produces a context-aware mixture. The arithmetic does not care where those three matrices come from. Attention is, in this sense, a piece of plumbing: three input ports and one output. The interesting question is which sequence we plug into each port.

That choice has names. When all three ports draw from the same sequence, we call the configuration self-attention. When the queries come from one sequence and the keys and values from another, we call it cross-attention. The classic Transformer decoder uses a particular flavour of cross-attention, where the decoder attends to the encoder's final hidden states; that pattern has its own name, encoder–decoder attention. Together, these three configurations cover nearly every modern Transformer use, from BERT and GPT through T5, BART, CLIP and Flamingo. Understanding the wiring is the first step in reading any new architecture diagram fluently.

Symbols Used Here

$\mathbf{Q}, \mathbf{K}, \mathbf{V}$query, key, value matrices

Self-attention

Self-attention is the configuration in which $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$ are all derived, by separate linear projections, from a single input sequence $\mathbf{X} \in \mathbb{R}^{n \times d}$. Concretely,

$$\mathbf{Q} = \mathbf{X} \mathbf{W}_Q, \qquad \mathbf{K} = \mathbf{X} \mathbf{W}_K, \qquad \mathbf{V} = \mathbf{X} \mathbf{W}_V,$$

with three learnt weight matrices. Each token in $\mathbf{X}$ generates a query about the rest of the sequence, a key advertising what it has to offer, and a value to be passed along if attended to. The output is a new sequence in which every token has been re-expressed as a weighted mixture of the others.

Self-attention is the workhorse of the Transformer. The encoder of the original 2017 model is built from stacked self-attention blocks; BERT and its descendants use bidirectional self-attention, in which any token can attend to any other; GPT and its descendants use causal self-attention, in which a token at position $t$ may attend only to positions $\le t$ (we discuss the masking trick later in this section). Vision Transformers apply self-attention over flattened image patches; AlphaFold's Evoformer applies it within and across multiple-sequence alignments; speech Transformers apply it over audio frames.

The conceptual point is that self-attention turns a sequence into a graph in which every node is connected to every other, with edge weights computed on the fly from content. There is no fixed locality, no sliding window, no recurrence. Long-range dependencies, agreement across a clause, coreference across a paragraph, the relationship between a question's first and last words, are no harder to model than adjacent ones, at least in principle. This is the property that made the Transformer so consequential in 2017: a flat, parallel architecture with the expressive reach previously available only through deep recurrent networks.

It is worth dwelling on the symmetry of the operation. Because $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$ all derive from the same input, every token plays three roles at once. It asks a question of its neighbours through its query, advertises itself to others through its key, and contributes content through its value. The same vector is interrogated and interrogator, and the network learns weights that make the three roles cooperate. Multi-head attention, covered in §13.3, simply runs several such role-plays in parallel with smaller per-head dimensions, so that different heads can specialise in different relations, syntactic agreement, lexical similarity, positional offset, and so on.

The cost is that the attention matrix is $n \times n$ in the sequence length, which is the source of the quadratic memory and compute that we revisit in §13.13. But the wiring itself, three projections of one sequence, is as simple as it sounds.

Cross-attention

Cross-attention loosens the constraint that all three matrices share an origin. The queries come from one sequence, while the keys and values come from another:

$$\mathbf{Q} = \mathbf{X}_{\text{q}} \mathbf{W}_Q, \qquad \mathbf{K} = \mathbf{X}_{\text{kv}} \mathbf{W}_K, \qquad \mathbf{V} = \mathbf{X}_{\text{kv}} \mathbf{W}_V.$$

Each query asks "in this other sequence, what is relevant to me?". The output has the same length as the query sequence, but its content is drawn from the key–value sequence.

This is the wiring that lets a Transformer condition one modality on another. In CLIP-style architectures, text queries attend to image keys to produce a representation aligned with visual content. In Flamingo, frozen language tokens cross-attend to vision tokens injected at carefully chosen layers, allowing a large pretrained language model to ground its generation in pictures without retraining its parameters. In Stable Diffusion's U-Net, latent image queries cross-attend to the keys and values produced by a CLIP text encoder, so that each denoising step is steered by the prompt. In speech-to-text models such as Whisper, the decoder cross-attends to encoded audio frames as it produces text tokens.

The same wiring also appears within a single modality. Retrieval-augmented systems cross-attend from a generator's working representation to a bank of retrieved passages. Memory-augmented Transformers cross-attend from the current sequence to an external store. The pattern is the same: queries on one side, keys and values on the other, attention weights as the bridge.

The reason cross-attention is so flexible is that the only requirement is that $\mathbf{X}_{\text{q}}$ and $\mathbf{X}_{\text{kv}}$ produce vectors of the same model dimension after their projections. The two sequences may differ in length, in modality, in language, or in time. They may be batched together, masked separately, and updated at different rates. As long as the inner dimensions match, the dot products go through.

A subtle but useful observation is that in cross-attention the keys and values can be precomputed once and then queried many times. In a translation system, the encoder is run once per source sentence and its key–value tensors are cached; the decoder then issues queries at each generation step against this fixed memory. The same trick underlies efficient inference in retrieval-augmented systems, where document embeddings sit in a vector index and queries arrive at request time. Cross-attention is therefore not just an architectural choice but a place where the asymmetry between expensive context preparation and cheap repeated querying can be exploited.

Encoder–decoder attention

Encoder–decoder attention is a particular cross-attention pattern, named for its role in the original Transformer. The encoder runs self-attention end to end over the source sequence, producing a sequence of contextualised hidden states $\mathbf{H}_{\text{enc}}$. The decoder then operates one block at a time, and within each block it performs three operations in order:

masked self-attention over its own partial output, so that each output position attends only to earlier output positions;
encoder–decoder attention, in which the decoder's hidden states form the queries and the encoder's final hidden states $\mathbf{H}_{\text{enc}}$ form the keys and values;
a position-wise feed-forward network that processes each position independently.

The encoder is run once; its output is reused at every decoder layer. This is what allows the decoder, when generating the $t$-th target token, to consult the entire source sequence freely while remaining causal with respect to its own output. In machine translation, it lets each French word being generated attend to the whole English source. In summarisation, it lets each summary token consult the full document. In speech recognition with a Transformer decoder, it lets each text token consult the full audio.

The original 2017 paper, Vaswani and colleagues' "Attention Is All You Need", established this pattern, and modern descendants follow it closely. T5 frames every NLP task as a sequence-to-sequence problem and uses encoder–decoder attention throughout. BART pretrains on denoising and uses the same wiring. Flan-T5, mT5 and ByT5 are direct continuations. Speech models such as Whisper and image-captioning models such as BLIP also adopt the encoder–decoder shape. When you see a diagram with two stacks of blocks and arrows running from the right-hand stack into the middle of the left-hand stack, you are looking at encoder–decoder attention.

The reason the pattern persists is that it cleanly separates the two roles a model must play in conditional generation. The encoder builds a rich, bidirectional representation of the input, free to look anywhere because nothing it sees needs to be predicted. The decoder, constrained to emit one token at a time, treats this representation as a fixed reference work and consults it repeatedly. Decoder-only models such as GPT collapse the two roles into a single causal stack, which is simpler and now more common, but the encoder–decoder split remains attractive whenever the input is large and the output is comparatively short, or whenever the input deserves a different inductive bias from the output.

Self-attention with masking

Masking is what lets self-attention play more than one role. The same projection layers and the same softmax produce very different behaviours depending on which entries of the attention matrix are forced to zero before the softmax. Two masks dominate practice.

The causal mask is a lower-triangular pattern: position $i$ may attend to positions $j \le i$, but not to positions $j > i$. In implementation, the upper triangle of the score matrix is set to $-\infty$ before the softmax, so those entries become zero in the output weights. This is the mask used in GPT-style decoders and in the masked self-attention of any encoder–decoder Transformer's decoder. It is what makes autoregressive language modelling well defined: at training time we can compute all positions' outputs in parallel while guaranteeing that each prediction depends only on its own past.

The padding mask zeros out positions corresponding to padding tokens in batched sequences of unequal length. Without it, a query would happily attend to the meaningless [PAD] slots and pollute its output. The padding mask is broadcast across the attention matrix so that any score whose key is a pad token is set to $-\infty$. Padding masks combine with causal masks by intersection: a position is attended to only if it is both in the past and not padding.

Other masks appear in specialised settings (block-diagonal masks for packed sequences, sliding-window masks for local attention, prefix masks for prefix-LM training) but causal and padding masks cover the great majority of production systems.

What you should take away

Attention is one operation; the names "self-attention", "cross-attention" and "encoder–decoder attention" describe where $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$ come from, not what the operation does.
Self-attention ties all three matrices to one sequence and is the workhorse of BERT, GPT, ViT and most modern Transformers.
Cross-attention splits the queries from the keys and values, allowing one sequence (or modality) to condition on another, as in CLIP, Flamingo, Stable Diffusion and Whisper.
Encoder–decoder attention is a specific cross-attention pattern in which a decoder attends to an encoder's final hidden states; it underpins the original Transformer, T5 and BART.
Causal and padding masks turn the same self-attention block into an autoregressive decoder or a batched bidirectional encoder, simply by deciding which scores to set to $-\infty$ before the softmax.