15.20 Inference and serving

We have spent most of this chapter on training. Inference deserves a section of its own. By 2026 the inference stack, how trained weights become responses to user queries, is where a great deal of frontier engineering happens.

The KV cache

When a Transformer generates one token at a time, the keys and values for previously-generated tokens are unchanged from one step to the next. Recomputing them would be wasteful. The KV cache stores them. For a model with $L$ layers, $H$ heads of dimension $d_h$, and sequence length $T$, the KV cache is

$$ \text{KV size} = 2 \cdot L \cdot H \cdot d_h \cdot T \cdot \text{bytes-per-value}. $$

For a 70 B model with $L = 80$, $H = 64$, $d_h = 128$ in BF16: $2 \cdot 80 \cdot 64 \cdot 128 \cdot T \cdot 2 = 2.6 \cdot 10^6 \cdot T$ bytes, or 2.6 MB per token. A 32 K context fills 84 GB of KV cache, more than the full model weights at 8-bit quantisation. Grouped-query attention with 8 KV heads instead of 64 cuts this by 8×, and is the main reason GQA has become standard.

vLLM and PagedAttention

The KV cache memory layout is a substantial engineering problem. Naive contiguous allocation wastes memory because batched requests have different sequence lengths. PagedAttention (Kwon et al., 2023, vLLM) 2023 borrows the page-table idea from operating systems: KV cache is allocated in fixed-size pages, with a per-request page table mapping logical positions to physical pages. This eliminates internal fragmentation and enables prefix sharing across requests. vLLM throughput is typically 2–4× higher than naive batched inference at the same hardware.

Speculative decoding

Generation latency is dominated by sequential dependency: token $t+1$ requires token $t$. Speculative decoding (Leviathan et al., 2023; Chen et al., 2023) breaks this by running a small "draft" model to propose $k$ tokens ahead, then verifying them in a single forward pass of the large model. If $k' \le k$ are accepted, the next iteration starts at position $k' + 1$. This trades a few percent more compute for substantial latency reduction (typically 1.5–3× speed-up at no quality loss). EAGLE (Li et al., 2024) replaces the draft model with a tree-of-drafts attached to the target model, achieving 3–5×.

Continuous batching

A request comes in, the model serves a few hundred tokens, the request finishes, the slot is freed, a new request can join the batch immediately rather than waiting for the whole batch to finish. This "continuous batching" or "in-flight batching" is what allows production servers to maintain near-100% GPU utilisation under bursty traffic. Without it, batches become synchronisation barriers.

Quantisation for serving

Training in BF16, serving in INT8 or INT4 is now standard. Two main approaches:

  • Post-training quantisation (PTQ): GPTQ, AWQ, Marlin. Calibrate on a small held-out set, find the quantisation grid that minimises reconstruction error.
  • Quantisation-aware training (QAT): include quantisation in the training loop. More expensive but yields better quality at very low bits.

By 2026, INT4 inference with 1–2% degradation is achievable for most frontier models, allowing a 70 B model to run on a single 80 GB GPU with healthy headroom for KV cache.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).