13.2 Scaled dot-product attention

Self-attention as Q–K–V dot products · 1:20. Query, key and value vectors produce an attention matrix over four tokens. Open transcript and references →
Scaled dot-product attention is the central operation of every modern transformer. Strip away the embeddings, the layer normalisation, the residual connections, the multi-head wrapper and the positional encodings, and what remains at the heart of GPT, BERT, T5, Llama, Claude, Gemini and DeepSeek is one short formula. Master that formula and you have the conceptual key to the whole family. Misunderstand it and every later detail floats free of any anchor.

The intuition, before the algebra, is the intuition of soft retrieval. Imagine a small library. Each book has a label on its spine, a short identifier that tells you what the book is about. Inside each book is content. Now suppose you walk in with a search request. A traditional library system would compare your request to every spine label, find the single best match, and hand you that one book. Soft retrieval is more permissive: it compares your request to every spine label, gives every book a relevance score, normalises those scores into a probability distribution and then hands you a blend of all the books, weighted by relevance. Books whose spines match your request well contribute heavily to the blend; books whose spines do not match contribute almost nothing. You leave with a single output that summarises the library through the lens of your particular question.

In the language of attention, the search request is the query, the spine labels are the keys, and the content of each book is the value. Attention computes a similarity between your query and every key, normalises those similarities into weights, and returns the weighted average of the values. Crucially the weights are differentiable: the whole operation is one big tensor expression, so gradients flow through it during training, which means the network can learn what makes a good query, what makes a good key, and what content is worth storing as a value.

This section formalises that picture. §13.1 motivated attention by showing why recurrent encoders run out of bandwidth on long sequences; §13.3 will stack many independent attention heads in parallel so the model can attend to several relationships at once; §13.4 will distinguish self-attention (where queries, keys and values all come from the same sequence) from cross-attention (where the query stream and the key/value stream are different). Here we do the arithmetic for one head, slowly, with every symbol named, every step justified and a small numerical example carried through to the final answer.

Symbols Used Here
$\mathbf{Q}$query matrix, shape $(n, d_k)$, one row per query
$\mathbf{K}$key matrix, shape $(m, d_k)$, one row per key
$\mathbf{V}$value matrix, shape $(m, d_v)$, one row per value
$d_k$query/key dimension
$d_v$value dimension
$n$number of queries
$m$number of keys/values (often $m = n$ in self-attention)
$\sqrt{d_k}$scaling factor that stabilises the softmax
$\mathrm{softmax}$applied row-wise, turning each row into a probability distribution

The formula

The whole operation, in one line, is

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right) \mathbf{V}.$$

That single expression is the engine of every transformer in production. It is worth lingering over each piece because the compactness hides several conventions that beginners often trip on.

The inputs are three matrices, $\mathbf{Q}$, $\mathbf{K}$ and $\mathbf{V}$. They share a row-major convention: each row of $\mathbf{Q}$ is one query vector, each row of $\mathbf{K}$ is one key vector, each row of $\mathbf{V}$ is one value vector. The number of queries, $n$, can differ from the number of keys, $m$, although in self-attention the two are equal because every token in the sequence acts as both a query and a key/value. The query and key dimension, $d_k$, must match, since you cannot take a dot product between vectors of different lengths; the value dimension, $d_v$, is independent and may be larger or smaller. In the original transformer paper Vaswani and colleagues set $d_k = d_v = 64$ per head, but no law of mathematics forces that choice.

The numerator $\mathbf{Q}\mathbf{K}^\top$ is the matrix of all pairwise dot products. Because $\mathbf{Q}$ is $(n, d_k)$ and $\mathbf{K}^\top$ is $(d_k, m)$, the product is $(n, m)$. Entry $(i, j)$ of this matrix is the dot product between query row $i$ and key row $j$, a scalar that measures how aligned those two vectors are. A large positive entry means query $i$ matches key $j$ well; a large negative entry means they actively disagree; a value near zero means they are roughly orthogonal and tell us nothing.

Dividing by $\sqrt{d_k}$ is the scaling that gives the operation its full name. We will spend a whole subsection below justifying this divisor; for now, treat it as a normalisation that keeps the softmax inputs in a sensible range regardless of how big $d_k$ is.

The softmax converts each row of the scaled score matrix into a probability distribution. Row $i$ becomes a vector of $m$ non-negative numbers that sum to one, the weights with which query $i$ should mix the $m$ values. Crucially the softmax is applied row-wise: each query gets its own independent distribution over keys. There is no normalisation across queries.

Finally we multiply by $\mathbf{V}$, which has shape $(m, d_v)$, to produce an output of shape $(n, d_v)$. Each output row is the weighted sum of value rows under that query's softmax weights. Geometrically, every output is a point inside the convex hull of the value vectors; attention can only interpolate between values, never extrapolate beyond them. This is a useful sanity check: if the values are bounded, the outputs are bounded by the same envelope.

Step by step

It helps to break the formula into five mechanical steps. If you know the shape and meaning of every intermediate tensor, the whole operation stops feeling magical and starts feeling like ordinary matrix algebra.

  1. Compute $\mathbf{Q}\mathbf{K}^\top$. This is a single batched matrix multiplication that produces an $(n, m)$ score matrix $\mathbf{S}$. Entry $S_{ij}$ is the unnormalised similarity between query $i$ and key $j$, defined as the dot product $\mathbf{q}_i \cdot \mathbf{k}_j = \sum_{\ell=1}^{d_k} q_{i\ell} k_{j\ell}$. Higher values indicate stronger alignment. The dot product is a particularly natural choice of similarity here because it is cheap on modern hardware: one matrix multiplication call, fully fusable into a tensor-core kernel, with no transcendental functions.

  2. Scale by $\sqrt{d_k}$. Divide every entry of $\mathbf{S}$ by $\sqrt{d_k}$ to obtain the scaled score matrix $\tilde{\mathbf{S}} = \mathbf{S}/\sqrt{d_k}$. This is a pointwise operation, $O(nm)$ in cost, and it does nothing to the ranking of scores within a row; it just shrinks their absolute magnitudes. The motivation, derived in detail below, is that the dot product of two random $d_k$-dimensional vectors with unit-variance entries has variance $d_k$, which means the raw scores grow with $d_k$ and would saturate the softmax. Dividing by $\sqrt{d_k}$ rescales the variance back to one, so the softmax sees inputs of order one regardless of head size.

  3. Apply softmax along the last dimension. Row $i$ of $\tilde{\mathbf{S}}$ becomes the probability distribution

    $$A_{ij} = \frac{\exp(\tilde{S}_{ij})}{\sum_{\ell=1}^{m} \exp(\tilde{S}_{i\ell})}.$$

    The softmax is a smooth surrogate for the argmax: as the gap between the largest score in a row and the rest grows, the largest weight tends to one and the rest tend to zero. With moderate gaps the row remains a genuine blend over multiple keys. After this step, $\mathbf{A}$ has the same shape as $\tilde{\mathbf{S}}$, namely $(n, m)$, but every row sums to exactly one. We sometimes call $\mathbf{A}$ the attention matrix or attention map and visualise it as a heatmap with queries on the vertical axis and keys on the horizontal.

  4. Multiply by $\mathbf{V}$. The product $\mathbf{O} = \mathbf{A}\mathbf{V}$ has shape $(n, d_v)$. Row $i$ of $\mathbf{O}$ equals $\sum_{j=1}^{m} A_{ij}\, \mathbf{v}_j$, a convex combination of the value vectors with the attention weights as coefficients. Because the weights are non-negative and sum to one, the output is guaranteed to lie inside the convex hull of the values, a useful property when reasoning about stability.

  5. Return the output. $\mathbf{O}$ is the attention layer's contribution: each input query's content-weighted summary of the value bank. Downstream, this output will typically be projected by another learned matrix, added to a residual stream and passed through layer normalisation, but those steps belong to §13.6 on the full transformer block.

Beginners often confuse the role of the three matrices. A useful sanity check: $\mathbf{Q}$ governs what we are looking for, $\mathbf{K}$ governs what each location advertises, $\mathbf{V}$ governs what each location actually delivers. Splitting keys from values lets the model use one set of features for matching and a different set for content, which is more expressive than a single similarity-and-content tensor would be.

Worked numerical example

The fastest way to internalise the formula is to compute it by hand on a small example. Take $d_k = d_v = 2$, three queries and three keys ($n = m = 3$), with the simple matrices

$$\mathbf{Q} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix}, \qquad \mathbf{K} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix}, \qquad \mathbf{V} = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 0 \end{pmatrix}.$$

The first two queries point along the standard basis directions; the third is the diagonal. The keys are identical to the queries, which is convenient because every query has at least one perfect match in the key set. The values are the two standard basis vectors followed by the zero vector, a deliberately asymmetric choice, because if the values were also a symmetric pattern the answer would be unilluminating.

Step 1: dot products. Compute $\mathbf{Q}\mathbf{K}^\top$:

$$\mathbf{Q}\mathbf{K}^\top = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix} \begin{pmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \end{pmatrix} = \begin{pmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{pmatrix}.$$

Read off the structure. Entry $(1, 1) = 1$ because query 1 is identical to key 1. Entry $(1, 2) = 0$ because query 1 is orthogonal to key 2. Entry $(1, 3) = 1$ because query 1 has a unit overlap with the diagonal key. Entry $(3, 3) = 2$ because the diagonal query and the diagonal key align perfectly along both components; this is the largest entry in the matrix, as expected.

Step 2: scale by $\sqrt{d_k} = \sqrt{2} \approx 1.4142$. Dividing pointwise gives

$$\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{2}} \approx \begin{pmatrix} 0.7071 & 0 & 0.7071 \\ 0 & 0.7071 & 0.7071 \\ 0.7071 & 0.7071 & 1.4142 \end{pmatrix}.$$

The relative ordering inside each row is unchanged (scaling never reorders), but every entry has shrunk towards zero by a factor of $\sqrt{2}$.

Step 3: row-wise softmax. For row 1, the scaled scores are $(0.7071, 0, 0.7071)$. Exponentiating: $e^{0.7071} \approx 2.0281$ and $e^0 = 1$. The unnormalised vector is $(2.0281, 1, 2.0281)$ with sum $5.0562$. Dividing through:

$$\text{row 1 weights} \approx (0.4011, 0.1978, 0.4011).$$

By symmetry of the second row of the scaled scores, $(0, 0.7071, 0.7071)$, we obtain

$$\text{row 2 weights} \approx (0.1978, 0.4011, 0.4011).$$

For row 3, the scaled scores are $(0.7071, 0.7071, 1.4142)$. Exponentiating: $e^{0.7071} \approx 2.0281$ (twice) and $e^{1.4142} \approx 4.1133$. The unnormalised vector is $(2.0281, 2.0281, 4.1133)$ with sum $8.1695$. Dividing:

$$\text{row 3 weights} \approx (0.2483, 0.2483, 0.5034).$$

So the attention matrix is

$$\mathbf{A} \approx \begin{pmatrix} 0.4011 & 0.1978 & 0.4011 \\ 0.1978 & 0.4011 & 0.4011 \\ 0.2483 & 0.2483 & 0.5034 \end{pmatrix}.$$

Every row sums to one (within rounding), as it must. Note how query 3, whose vector overlaps both standard-basis keys, distributes more than half its mass on the diagonal key while splitting the rest evenly between keys 1 and 2, exactly what a soft-retrieval system should do when one match is dominant but not exclusive.

Step 4: multiply by $\mathbf{V}$. The output is $\mathbf{A}\mathbf{V}$, shape $(3, 2)$. Compute row by row.

Row 1: $0.4011 \cdot (1, 0) + 0.1978 \cdot (0, 1) + 0.4011 \cdot (0, 0) = (0.4011, 0.1978)$.

Row 2: $0.1978 \cdot (1, 0) + 0.4011 \cdot (0, 1) + 0.4011 \cdot (0, 0) = (0.1978, 0.4011)$.

Row 3: $0.2483 \cdot (1, 0) + 0.2483 \cdot (0, 1) + 0.5034 \cdot (0, 0) = (0.2483, 0.2483)$.

So the final attention output is

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) \approx \begin{pmatrix} 0.4011 & 0.1978 \\ 0.1978 & 0.4011 \\ 0.2483 & 0.2483 \end{pmatrix}.$$

Pause and read this answer. Query 1 retrieves a vector that is mostly aligned with $\mathbf{v}_1 = (1, 0)$ but tilted slightly towards $\mathbf{v}_2 = (0, 1)$ because some of its attention mass landed on the partial-match key. Query 2 is the mirror image: its output sits mostly on the $\mathbf{v}_2$ direction with a little contamination from $\mathbf{v}_1$. Query 3, balanced between both basis directions, retrieves a roughly equal blend of $\mathbf{v}_1$ and $\mathbf{v}_2$, scaled down by the half of its attention mass that landed on the zero-valued third key. The third value being zero acts as a gentle damping: query 3 spreads its attention thinly, so even though it matches three keys it still produces a smaller-norm output than a query that concentrated on a single non-zero value.

Three queries, three keys, three values, a hand calculation that fits on one page. Every transformer in production runs the same five steps, just with thousands of queries instead of three and a thousand-dimensional embedding instead of two.

Why scale by $\sqrt{d_k}$?

The scaling factor is the most easily misunderstood part of the formula. The original paper justifies it in a single footnote, which is unfortunate because the argument is short, important and only takes a paragraph to derive properly.

Suppose the components of the query vector $\mathbf{q}$ and the key vector $\mathbf{k}$ are independent random variables, each with mean zero and variance one. This is a reasonable approximation early in training because the projection matrices $\mathbf{W}^Q, \mathbf{W}^K$ are typically initialised with small random Gaussian entries. The dot product is

$$\mathbf{q}^\top \mathbf{k} = \sum_{i=1}^{d_k} q_i k_i.$$

By independence and zero mean, each product term $q_i k_i$ has expectation zero, so the dot product itself has mean zero. Its variance is the sum of the variances of the product terms,

$$\operatorname{Var}(\mathbf{q}^\top \mathbf{k}) = \sum_{i=1}^{d_k} \operatorname{Var}(q_i k_i) = \sum_{i=1}^{d_k} \mathbb{E}[q_i^2]\, \mathbb{E}[k_i^2] = d_k.$$

So the standard deviation grows like $\sqrt{d_k}$. For $d_k = 64$, a typical per-head dimension, that is a standard deviation of $8$ in the raw scores. For $d_k = 128$ it is more than $11$.

Why does that matter? Because the softmax saturates. If one logit is much larger than the others, the softmax output approaches a one-hot vector: nearly all the probability mass concentrates on the argmax position, and the gradient with respect to every other logit decays exponentially. Concretely, if the largest logit is $z_\text{max}$ and the second-largest is $z_\text{max} - \Delta$, the gradient signal through the runner-up position decays like $e^{-\Delta}$. For $\Delta = 10$ that signal is already $4.5 \times 10^{-5}$, vanishingly small.

We do not want that during training. We want the softmax to remain sensitive to the relative ordering of all keys, not just the top one. The fix is to keep the softmax inputs at order one regardless of $d_k$. Dividing the dot product by $\sqrt{d_k}$ does exactly that:

$$\operatorname{Var}\!\left(\frac{\mathbf{q}^\top \mathbf{k}}{\sqrt{d_k}}\right) = \frac{d_k}{d_k} = 1.$$

For $d_k = 64$ this means dividing by $8$. Without the divisor the network sees standard deviations of $8$ in the logits, the softmax saturates, gradients vanish through every non-argmax position and training stalls. The early transformer literature has many horror stories of training runs that diverged because someone forgot the scale factor. Frameworks like PyTorch and JAX now bake the scaling into their official attention primitives precisely so that beginners cannot omit it by accident.

Connection to soft retrieval

We opened with a library metaphor and the algebra has now made it precise. Treat the keys as labels in a database, the values as their stored content, and the query as the search request. Hard retrieval, the kind a SQL SELECT performs, would compute a similarity, return the top match and stop. Attention does something gentler: it computes similarities to every row, normalises them with softmax, and returns a weighted blend of all values. Nothing is rejected; nothing dominates absolutely; every row contributes in proportion to its match score.

This blend has three properties that make it ideal for deep learning. First, it is fully differentiable: small changes to any query, key or value produce small, computable changes in the output, so gradients propagate cleanly through the operation. A hard top-$k$ selector would be a step function with zero gradients almost everywhere. Second, the operation is associative across the value dimension: the same attention weights apply uniformly to every column of $\mathbf{V}$, which is the property that lets us push value dimensions through cheap matrix multiplies. Third, the operation is content-addressable in the sense of associative memory: rather than indexing by position, attention indexes by similarity in a learned embedding space, which is much closer to how cognitive scientists describe human memory recall than a hash table is.

Two important consequences flow from the soft-retrieval reading. The first is that attention can only produce outputs inside the convex hull of the values: nothing is invented, only re-weighted. The second is that the expressivity of attention comes from the learned projections that produce $\mathbf{Q}, \mathbf{K}, \mathbf{V}$, not from the attention formula itself. The formula is fixed, parameter-free and embarrassingly simple. The intelligence sits in the projection matrices, which decide what counts as a useful query, a useful key and a useful piece of content. Training an attention layer is, in effect, training a learned soft hash function.

Computational cost

The arithmetic above is cheap on three vectors but expensive on a thousand. Counting operations: $\mathbf{Q}\mathbf{K}^\top$ is a multiplication of an $(n, d_k)$ matrix by a $(d_k, m)$ matrix, costing $O(n m d_k)$ time and producing an $n \times m$ score matrix. The softmax is $O(nm)$, one exponential per entry, plus a normalising sum per row. Multiplying the resulting attention matrix by $\mathbf{V}$ is $O(n m d_v)$. Total time is $O(nm(d_k + d_v))$ and total memory for the attention matrix alone is $O(nm)$.

The crucial observation is that for self-attention, where $n = m =$ sequence length, the cost is quadratic in sequence length. Doubling the sequence quadruples the work and quadruples the memory for the attention matrix. This is the infamous $O(n^2)$ bottleneck of transformer attention, and it is the single largest engineering constraint on long-context language models.

Some numbers to ground the intuition. For a sequence of $n = 1024$ tokens, typical for early BERT and GPT-2, the attention matrix has about a million entries, which fits comfortably in GPU memory at half precision. For $n = 8192$ tokens, the matrix has 67 million entries; for $n = 32768$, over a billion; and for $n = 100\,000$, ten billion entries, even storing the matrix in fp16 costs roughly 20 GB per head, before any other state. Long-context models, and the engineering pressure to extend context windows from 2k to 200k tokens, are largely a story of getting around this wall.

The wall can be attacked in several directions. FlashAttention (§13.14) reorganises the computation so that the full $n \times n$ matrix is never materialised in high-bandwidth memory at once; instead, blocks of the matrix are computed inside the GPU's much faster on-chip SRAM and discarded after their contribution to the output is accumulated. The arithmetic is exactly the same, every floating-point operation matches the naive implementation, but the memory traffic is dramatically reduced. Linear attention variants (§13.15), including Linformer, Performer, RWKV and Mamba, change the algorithm itself, replacing the softmax-of-dot-products with kernels or state-space updates that scale as $O(n)$ rather than $O(n^2)$, at the cost of some expressivity. Sparse attention (§13.18) restricts each query to attend to a structured subset of keys (a local window, a strided pattern, a few global tokens) bringing the cost down to roughly $O(n \log n)$ or $O(n \sqrt{n})$.

For most of this chapter we work with the dense, exact attention formula. The quadratic cost is real but tolerable up to a few thousand tokens, which covers the overwhelming majority of practical workloads. The workarounds in §13.13–§13.15 are best understood as engineering responses to a specific bottleneck, not as replacements for the underlying soft-retrieval idea, which is identical across all of them.

What you should take away

  1. Attention is soft retrieval. A query is compared to every key by dot product; the resulting similarities are normalised by softmax into weights; the weights mix the values into a single output. Everything else is mechanism.

  2. Three matrices, one formula. $\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k}) \mathbf{V}$. Memorise the shapes: $\mathbf{Q}$ is $(n, d_k)$, $\mathbf{K}$ is $(m, d_k)$, $\mathbf{V}$ is $(m, d_v)$, output is $(n, d_v)$.

  3. The $\sqrt{d_k}$ divisor is non-negotiable. Without it, dot products grow with $d_k$, the softmax saturates and gradients vanish. With it, the softmax inputs stay at order one regardless of head size.

  4. Outputs are convex combinations. Each output row is a weighted average of the value rows, so attention can interpolate between values but never extrapolate beyond them, a useful stability property.

  5. The cost is quadratic in sequence length. The $n \times n$ attention matrix is the engineering pain point of every long-context transformer, motivating FlashAttention, linear attention and sparse attention as the chapter progresses.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).