13.19 Inference economics: KV cache, prefill, decode, PagedAttention
Training a model is a one-time cost. Serving a model is the recurring cost. Inference economics dominate frontier-AI economics, and they are dominated by a single fact: the KV cache.
What the KV cache is
When a Transformer generates token $t$, it computes attention from the new token's query against the keys and values of every previous token. If you regenerated those keys and values from scratch every step, you would do $O(n)$ work per token and $O(n^2)$ work to generate the full sequence, which is exactly what we are trying to avoid.
The KV cache stores the keys and values produced by every previous token, layer by layer, head by head. At step $t$, you only compute $\mathbf{Q}_t$, $\mathbf{K}_t$, $\mathbf{V}_t$ for the new token (one row's worth of work), append $\mathbf{K}_t$ and $\mathbf{V}_t$ to the cache, and run attention against the cached $\mathbf{K}_{1:t}$, $\mathbf{V}_{1:t}$.
Per-token inference cost: $O(t \cdot d)$ for attention, $O(d^2)$ for the projections, roughly $2N$ FLOPs total, as we noted in §13.9.
KV cache memory
The cache size per token is
$$ 2 \cdot L \cdot h \cdot d_k \cdot \text{bytes-per-float}. $$
For LLaMA-3 70B (fp16, $L = 80$, $h = 64$, $d_k = 128$): $2 \cdot 80 \cdot 64 \cdot 128 \cdot 2 = 2{,}621{,}440$ bytes $= 2.5$ MB per token. For a 32K-token context, that is 80 GB of KV cache, comparable to the parameters themselves.
This is why inference for long-context models is memory-bound. For each generated token, the GPU must read every byte of the KV cache from HBM to compute attention. At 3 TB/s of HBM bandwidth and 80 GB of cache, that is ~27 ms per token just to read the cache, before any compute. This is the practical floor on long-context generation latency. Multi-head Latent Attention (MLA), introduced in DeepSeek-V2 and used in V3 and R1, compresses the KV cache by roughly 93 per cent compared to standard attention, the leading 2024-2026 alternative to GQA.
Prefill vs decode
Inference has two phases with very different characteristics:
Prefill: process the entire prompt to produce the first token. The model runs over all $n_\text{prompt}$ tokens in parallel, one big matrix multiply. Compute-bound (the matmul is $\sim 2 N \cdot n_\text{prompt}$ FLOPs), tensor cores at full utilisation, fast in absolute throughput.
Decode: generate one token at a time, appending to the KV cache. Per-token compute is small ($\sim 2 N$ FLOPs), but each step requires a full pass through the model parameters in HBM. Memory-bandwidth bound. Slow per token; tensor cores severely underutilised because the operation is matrix-vector instead of matrix-matrix.
A typical serving rule of thumb: prefill is ~10-100× faster per token than decode. For a chatbot serving 2K-token responses to 1K-token prompts, the bulk of the wall-clock time is decode, not prefill. Optimising decode (better batching, speculative decoding, smaller KV caches via GQA/MQA, KV quantisation) is where the throughput gains live.
Continuous batching
Naive batching: collect $B$ requests, pad them to the same length, run them together. Wasteful: short responses block on long ones. Continuous batching (used in vLLM Kwon, 2023, TensorRT-LLM, TGI) batches requests at the token level: as soon as one request finishes, a new request joins the batch on the next step. This keeps the GPU utilised even with heterogeneous request lengths, raising throughput by 2–5×.
PagedAttention
vLLM's PagedAttention Kwon, 2023 borrows virtual-memory paging from operating systems to manage the KV cache. The KV cache is partitioned into fixed-size blocks (e.g. 16 tokens per block); each request has a block table mapping logical positions to physical blocks. This eliminates internal fragmentation, allows dynamic growth without copies, and enables prefix caching (multiple requests sharing the KV cache of a common prefix). PagedAttention is the basis of most modern serving systems.
Speculative decoding
Speculative decoding Leviathan, 2023 runs a small draft model to produce $k$ candidate tokens quickly, then runs the large target model in parallel on those $k$ tokens to verify them. Tokens that the target model agrees with are accepted; the first disagreement triggers a single target-model token, after which speculation resumes. Since the verification is one big-model forward pass on $k$ tokens (compute-bound, tensor cores happy) instead of $k$ separate decode steps (memory-bound, tensor cores idle), speculative decoding is typically 2–3× faster wall-clock with no quality loss. Anthropic, OpenAI, and the open ecosystem all use it in production.
Quantisation
Inference-time quantisation compresses the model's weights and/or activations from fp16 down to int8, int4, or even lower. The arithmetic is approximate but, with care, quality loss is minimal. Post-training quantisation methods like GPTQ and AWQ produce 4-bit weights for a 70B model that fit on a single 48 GB GPU and run roughly 2× faster than fp16 because they are still memory-bandwidth bound and have half the bytes. Quantisation-aware training can do even better.
Quantising the KV cache itself (e.g. to int8) further halves the cache size and roughly halves decode latency. KV-cache quantisation is becoming standard in serving.
A concrete inference cost calculation
Take Claude Sonnet 4.6, public pricing in 2026 is $3 per million input tokens and $15 per million output tokens; Opus 4.6 / 4.7 is $5 per million input and $25 per million output. The 5× ratio reflects the prefill–decode asymmetry: input tokens go through prefill (compute-bound, fast, cheap per token); output tokens go through decode (memory-bound, slow, expensive per token). For a 200B-class model on H100s, decode throughput is roughly 50–100 tokens/sec/GPU at batch size 1, growing sub-linearly with batch size. The headline economics of AI products, why a chat reply costs roughly half a cent and an image-generation API call a couple of cents, are dominated by these per-token decode costs.
Optimising decode is therefore the central problem of frontier-AI commercialisation, and the techniques described in this section (GQA, KV-cache quantisation, continuous batching, paged attention, speculative decoding, MoE) are all attempts to push more useful generated tokens through the same hardware in the same time.