Attention & Transformers: 13.13   The quadratic wall: attention complexity

Dr Chris Paton

13.13 The quadratic wall: attention complexity

The previous section closed with GPT, a stack of decoder-only blocks that, given enough parameters and enough text, learns in context, follows instructions, and writes code. Everything in §13.12 was about what such models can do once they fit on the hardware. This section is about the prior question: how much hardware do they need? The answer, for vanilla Transformers, is governed by a single inequality. Self-attention scales as the square of the sequence length. Double the context, quadruple the cost. That sentence has driven a decade of follow-up research, and it explains why the difference between an 8 000-token chatbot and a million-token reasoning system is not a tweak but a redesign.

The intuition is concrete. To attend to every token from every other token, the model builds an $n \times n$ matrix of query–key scores. For an 8 000-token context, that is sixty-four million scores per head per layer. For a million-token context, which Gemini 1.5 advertises and Claude is creeping towards, it is one trillion scores per head per layer. In FP16 that single matrix would weigh in at 2 TB before you have stored a single value. Naively, this is physically impossible on any accelerator that exists today. The wall is real, it is reached quickly, and the next two sections (§13.14 FlashAttention and §13.15 linear / sub-quadratic alternatives) exist because of it. Treat this section as the motivation for everything that follows.

Symbols Used Here

$n$sequence length (number of tokens)

$d$model (hidden) dimension

$h$number of attention heads

$L$number of Transformer layers

The cost

Self-attention in a single head computes three projections, queries, keys, values, each $n \times d$, and then forms the score matrix $\mathbf{S} = \mathbf{Q}\mathbf{K}^\top$, which is $n \times n$. Softmax along rows turns scores into weights, and the output is $\mathbf{O} = \operatorname{softmax}(\mathbf{S})\mathbf{V}$. The two matrix multiplications dominate. Each is $n \times d$ multiplied by a $d \times n$ or $n \times d$ matrix, costing roughly $n^2 d$ floating-point operations. With $h$ heads operating in parallel over $d/h$ channels each, the total per layer is $O(n^2 d)$ in compute and $O(n^2)$ in memory for the score matrix itself. Multiply by $L$ layers and you have the headline number: a forward pass through an $L$-layer Transformer costs $O(L n^2 d)$ flops, with peak memory $O(n^2)$ per layer for the attention scores.

Training is worse. Backpropagation through attention requires the score matrix again on the backward pass to compute gradients with respect to $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$. Naive implementations materialise it twice and store the softmax probabilities for the chain rule. A 32-layer model with $n = 32{,}768$ tokens therefore hoards roughly $32 \times 32768^2 \times 2$ bytes, about 70 GB, purely for activations on a single example. This is why long-context training pre-FlashAttention required either drastic gradient checkpointing, model parallelism with expensive all-reduces, or simply giving up and clipping the context.

It is worth pausing on the constants. The big-O notation hides everything we actually pay for: the bytes per element, the number of heads, the number of layers, and the constant factor that comes from the softmax itself (which is read-modify-write rather than a single pass). A quick sanity check: an A100 80GB SXM with ~2.0 TB/s of HBM bandwidth will spend roughly 30 ms simply moving a single $32{,}768 \times 32{,}768$ FP16 score tensor (about 2 GB) in and out of memory once. Multiply that by layer count, by heads, by forward-and-backward, and by the steps in a batch, and you arrive at why pre-FlashAttention long-context training was widely considered unreasonable, not because the maths was hard, but because the bandwidth bill made the schedule untenable. The $O(n^2)$ number on the page is conservative; the reality, with constants, is even more demanding.

Why this matters

Three concrete consequences flow from $O(n^2)$. First, every doubling of context costs four times as much. A model engineered comfortably for 4 K windows runs out of memory at 8 K and out of patience at 16 K. The early Transformers of 2017–2019 stayed at 512 tokens for exactly this reason: BERT-base fit, BERT-long did not. Second, latency at long context becomes dominated by attention rather than by the feed-forward sub-layer, even though FFNs typically have more parameters. The FFN is $O(n d^2)$, linear in $n$, so for short contexts it dwarfs attention, but the curves cross around $n = d$, and beyond that the quadratic term wins. For modern frontier models with $d \approx 12{,}000$ and $n$ pushing $10^6$, attention is the bottleneck by two orders of magnitude. Third, training dataset design changes. If you cannot afford long-context training, you cannot expose the model to the kind of multi-document, book-length, codebase-wide structure that long contexts enable in the first place, a chicken-and-egg problem that long-context teams broke only by combining all three escape routes below.

Those three escape routes correspond exactly to §13.14, §13.15, and §13.18:

Compute the same thing more cleverly. FlashAttention (§13.14) does not change the maths, it computes exact softmax attention, but it never materialises the full $n \times n$ matrix in high-bandwidth memory. By tiling the computation and fusing the kernels, it brings memory down to $O(n)$ and wall-clock time down by a factor of two to four on real GPUs. The compute is still $O(n^2 d)$; the constants and the memory are dramatically better.
Approximate, gaining a true asymptotic. Linear attention, kernel attention, low-rank attention (§13.15) factor the softmax in ways that make the cost $O(n d^2)$ rather than $O(n^2 d)$. For $n \gg d$ that is a genuine win, at the price of no longer computing exact softmax. Linformer projects keys and values down; Performer uses random Fourier features to approximate the softmax kernel; RetNet and RWKV recast attention as a linear recurrence. State-space models such as Mamba leave attention behind altogether and use selective convolutions to mix tokens in $O(n \log n)$ or $O(n)$.
Attend sparsely. Restrict each token to attend to a band, a stride, a sliding window, or a learned subset (§13.18). Longformer, BigBird, and the routing schemes of mixture-of-experts attention all live here. These methods preserve a softmax core but reduce the score matrix to $O(n \log n)$ or even $O(n)$ entries.

KV cache for inference

So far we have been discussing the training and prefill cost. Inference, the autoregressive decode loop where the model emits one token at a time, has a different shape, and a different problem. Naively, generating token $t+1$ requires running attention over all $t$ existing tokens, redoing every projection from scratch. That is $O(t^2)$ per step and $O(T^3)$ over $T$ tokens generated. Nobody does it that way. Instead, every Transformer inference engine keeps a KV cache: the keys and values for every past token at every layer are computed once and stored. To generate the next token, the model projects only the new token's $\mathbf{q}$, attends it against the cached $\mathbf{K}$ and $\mathbf{V}$, and appends the new $\mathbf{k}$ and $\mathbf{v}$ to the cache. The per-step cost drops from $O(t^2 d)$ to $O(t d)$, and the total decode cost from $O(T^3)$ to $O(T^2 d)$.

The cache pays a price in memory. For each layer, each head, each token, you store a key vector and a value vector. Total bytes are roughly $2 \times L \times n \times d \times \text{bytes per element}$. A 32-layer, $d = 4096$ model decoding 100 K tokens in FP16 holds about 50 GB of cache. This is now the dominant inference memory cost on long-context models, larger than the weights themselves on commodity GPUs, and it is what PagedAttention (§13.19) was invented to manage. Quantising the cache to FP8 or INT4, sharing keys across heads (multi-query and grouped-query attention), and offloading to CPU memory are all live techniques. The pattern to remember: training is bottlenecked by the score matrix; inference is bottlenecked by the cache.

There is a second, subtler cost during decode: the prefill, where the model ingests the prompt before generating anything. Prefill is essentially a single forward pass at length $n$, and its cost is the full $O(n^2 d)$ described above. For long prompts this is what users perceive as "time to first token". A 200 K-token Claude prompt, on a single H100, may take tens of seconds to prefill before the first response token streams. Chunked prefill, FlashAttention, and aggressive kernel fusion all attack this bottleneck. The decode phase that follows enjoys the linear-in-$t$ savings of the KV cache, but the prefill is unavoidable, and it scales quadratically. This is a useful reminder that $O(n^2)$ is not abstract, it is exactly what you wait through when you paste a long document into a chat window.

Modern context lengths

Where does this leave production models in 2024–2026?

GPT-3 (2020), 2 K tokens. Modest by today's standards; sufficient for the tasks of its era.
GPT-4 (2023), 8 K and 32 K variants at launch; 128 K by GPT-4 Turbo.
Claude 2 (2023), 100 K. The first widely-available long-context model.
Claude (2024-2026), 200 K from 3 / 3.5, with Opus 4.6 / 4.7 shipping 1 M context as standard and Sonnet 4.5 / 4.6 also at 1 M.
Gemini (2024-2026), Gemini 1.5 introduced 1 M tokens; Gemini 2.5 Pro is at 1 M context with 2 M still in preview; long-context retrieval at near-perfect recall on needle-in-haystack benchmarks.
Most open-weight models (2024–2026), 8 K to 32 K base, 128 K with rope scaling and continued pre-training.
DeepSeek-V3 / R1 (2024–2025), 128 K with multi-head latent attention compressing the KV cache by an order of magnitude.

None of these were achievable with naive softmax attention on naive hardware. Every one of them blends FlashAttention or its successors, KV-cache compression, and (in Gemini's case especially) speculative architectural changes that almost certainly involve sub-quadratic ingredients. The headline numbers are advertising; the engineering underneath is the story of §13.14 to §13.19.

What you should take away

Self-attention costs $O(n^2 d)$ in compute and $O(n^2)$ in memory per layer. Doubling the sequence length quadruples the attention cost.
The crossover where attention overtakes the feed-forward sub-layer occurs around $n \approx d$. Beyond that, attention dominates training and prefill latency.
Three families of solutions exist: IO-aware exact attention (FlashAttention, §13.14), sub-quadratic approximations and replacements (linear / kernel / state-space, §13.15), and sparse patterns (§13.18). They are complementary, not competing.
Inference uses a KV cache to avoid recomputing past keys and values; the cache replaces the score matrix as the dominant memory cost and shapes the design of multi-query attention, grouped-query attention, and PagedAttention (§13.19).
Frontier 2024–2026 context lengths, 200 K to 2 M tokens, are achieved by stacking these techniques. No single trick is sufficient; the quadratic wall is pushed back, not knocked down.