PagedAttention, Glossary, Textbook of AI

PagedAttention (Kwon et al., 2023) is the central innovation of vllm. It addresses the wasted memory that arises when serving many concurrent generations of unpredictable length on a fixed-size kv-cache buffer. The insight is that the same problem, many processes of unknown size sharing one physical memory, was solved decades ago in operating systems by virtual memory paging, and the same solution applies to attention.

The naive approach is to pre-allocate a contiguous buffer of size $L_\mathrm{max}$ per sequence, where $L_\mathrm{max}$ is the maximum context length. If a sequence finishes after 200 tokens but $L_\mathrm{max} = 32{,}768$, the unused 32{,}568 tokens of cache are wasted. Worse, internal fragmentation within sequences and external fragmentation between sequences leave large pools of memory unusable even when total free memory is plentiful. Empirically vLLM showed that pre-allocation wastes 60–80% of KV cache memory on real workloads.

PagedAttention divides the KV cache into fixed-size blocks (typically 16 tokens per block), and each sequence is represented by a block table mapping logical block index $\to$ physical block address, directly analogous to a process page table. New blocks are allocated on demand as a sequence grows, drawn from a shared free pool. When a sequence finishes, its blocks return to the pool with no fragmentation. Only the last block of any sequence is partially full, capping waste at one block per sequence rather than $L_\mathrm{max}$ tokens.

The attention kernel must be modified to operate on this block-strided layout. Rather than reading $K_{1:t}, V_{1:t}$ from a contiguous tensor, the kernel walks the block table and performs attention over the (non-contiguous) physical blocks:

$$\mathrm{softmax}\!\left(\frac{q_t [K_{b_1}, K_{b_2}, \dots]^\top}{\sqrt{d_h}}\right) [V_{b_1}, V_{b_2}, \dots],$$

where $b_1, b_2, \dots$ are the physical block addresses for the sequence's logical blocks. The overhead of indirection is small because GPU memory accesses are already gather-style.

PagedAttention enables several capabilities that contiguous caches cannot:

Copy-on-write sharing for parallel sampling. When a sequence forks (e.g. beam search with $k$ beams, or generating $n$ samples from one prompt), the prefix blocks are shared by reference and only the divergent suffix blocks are copied. Memory cost scales with unique tokens generated, not $k \cdot n$.
Prefix caching. Identical system prompts across requests are stored once, with each request's block table pointing to the shared prefix blocks. A 2048-token system prompt shared across 100 concurrent requests costs the cache of one prompt rather than 100.
Preemption. When memory pressure spikes, a sequence's blocks can be evicted to host RAM (or recomputed from the original prompt) and restored later, supporting graceful degradation rather than out-of-memory crashes.

The combination of PagedAttention with continuous-batching is what gives vLLM its characteristic 2–4× throughput improvement over naive serving stacks: the cache is densely packed, the GPU never waits for unused buffer to be reallocated, and admission control can pack more concurrent requests into the same hardware.

Video

Discussed in:

Chapter 15: Modern AI, Engineering at Scale

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).