KV Cache, Glossary, Textbook of AI

The KV cache is the dominant memory and bandwidth consumer in autoregressive transformer inference. During generation each new token attends to all previous tokens through the attention-mechanism, and the keys $K$ and values $V$ for those previous tokens depend only on token identities and position, not on the new token. Caching them once and reusing them across every subsequent step turns a per-step cost of $O(n^2 d)$ into $O(n d)$, where $n$ is the current sequence length and $d$ is the head dimension.

For each layer the cache stores

$$K_\mathrm{cache} \in \mathbb{R}^{n \times h \times d_h}, \qquad V_\mathrm{cache} \in \mathbb{R}^{n \times h \times d_h},$$

where $h$ is the number of heads and $d_h$ the per-head dimension. At generation step $t$ the new token's key $k_t$ and value $v_t$ are appended to the cache, and attention is computed over the full cached sequence:

$$\mathrm{Attn}(q_t, K_{1:t}, V_{1:t}) = \mathrm{softmax}\!\left(\frac{q_t K_{1:t}^\top}{\sqrt{d_h}}\right) V_{1:t}.$$

Per layer per token the cache grows by $2 \cdot h \cdot d_h$ values, for a 70B-parameter model with 80 layers and hidden size 8192 (in BF16), that is roughly $2.5$ MB of cache per token. A single 4096-token context costs $\sim 10$ GB of cache; serving 100 such contexts concurrently costs a terabyte of HBM. This $O(L \cdot n \cdot d)$ scaling, where $L$ is the number of layers, is why inference systems devote enormous engineering effort to managing the cache.

Several architectural and systems techniques attack this cost:

Multi-Query Attention (MQA): all heads share one key and one value, shrinking the cache by a factor of $h$.
Grouped-Query Attention (GQA, used by Llama 2 70B+ and Mistral): heads share keys and values within groups of, say, 8, giving a middle ground between MQA and full multi-head attention.
Sliding-window attention (Mistral): cap each token's attention range at $w$ tokens, bounding cache size at $O(L \cdot w \cdot d)$ regardless of context length.
KV cache quantisation: store keys and values in INT8 or FP8 rather than BF16, halving cache size with minimal accuracy loss.
PagedAttention (paged-attention): manage the cache in fixed-size blocks like an OS managing virtual memory, eliminating internal fragmentation when serving many sequences.

Cache eviction is necessary when streaming generation exceeds available memory. Strategies include sliding-window eviction (drop the oldest tokens beyond $w$), H2O (Heavy-Hitter Oracle, keep the tokens that have historically attracted the most attention plus a recent window), and StreamingLLM (always keep the first few "attention sink" tokens plus a sliding window, exploiting the empirical observation that early tokens absorb disproportionate attention mass).

The KV cache also enables prefix caching: when many requests share an identical system prompt, the cache for those prefix tokens can be computed once and reused across requests. vllm and other serving stacks implement this with reference-counted block tables, often delivering 5–10× throughput improvements on chat workloads where every conversation starts with the same long preamble.

Discussed in:

Chapter 15: Modern AI, Engineering at Scale

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).