vLLM, Glossary, Textbook of AI

vLLM is an open-source LLM serving system originating from UC Berkeley's Sky Computing Lab (Kwon et al., 2023). It has become the de facto standard for self-hosted LLM inference because it delivers 2–4× the throughput of naive Hugging Face Transformers serving on the same hardware, while exposing an OpenAI-compatible HTTP API that drops into existing application code unchanged. The technical core is the combination of paged-attention, continuous-batching, and prefix caching, all unified under a single scheduler.

The vLLM architecture has four cooperating layers:

API server, an asynchronous HTTP front-end (FastAPI) speaking the OpenAI Chat Completions and Completions schemas, with streaming support via server-sent events. This makes vLLM a drop-in replacement for the OpenAI API for any client that takes a base_url argument.
AsyncLLMEngine, an event loop that admits new requests, pulls completed tokens from the worker, and routes them back to the appropriate streaming response.
Scheduler, at every generation step, decides which sequences to run (subject to memory and batch-size budgets), which to preempt, and which new prefill requests to admit. Implements continuous batching plus chunked-prefill in vLLM v1.
Worker(s), the GPU executor running the model itself. Each worker holds a tensor-parallel shard of the model weights and a paged KV cache. Multi-GPU deployments use tensor-parallelism within a node and pipeline parallelism across nodes.

PagedAttention manages the KV cache in fixed-size blocks (16 tokens by default) drawn from a shared physical pool. Each sequence carries a block table mapping its logical positions to physical blocks, so memory waste from variable-length sequences is bounded by one block per sequence rather than the maximum context length. Prefix caching uses content hashing on the block table: when two requests share a prompt prefix, their block tables point to the same physical blocks, computed once. This is transparent to the application and routinely delivers 5–10× throughput on chat workloads with long shared system prompts.

Quantisation support is broad: vLLM ingests models in FP16, BF16, FP8, INT8, INT4 (gptq, AWQ, SqueezeLLM, BitsAndBytes), and AutoFP8. Quantised weights compose with PagedAttention transparently, since the cache layout is independent of weight precision.

LoRA adapters can be hot-swapped at request time using the multi-LoRA scheduler, which loads multiple adapters into GPU memory and selects the right one per request. This is useful for serving many fine-tuned variants from a single shared base model.

Speculative decoding is supported for further latency reduction: a small draft model proposes $k$ tokens, the target model verifies all $k$ in one forward pass, and accepted tokens are committed. Verified speedups range from 2–3× on most workloads.

The performance characteristics matter for capacity planning. On an H100, vLLM serving Llama-3-70B in FP8 reaches roughly 10{,}000 output tokens per second at high concurrency, with a time-to-first-token of 100–200 ms and an inter-token latency of 20–40 ms. Throughput is bounded by HBM bandwidth in decode (memory-bound) and by tensor-core throughput in prefill (compute-bound), which is why chunked prefill is the dominant scheduling improvement in v1.

Competitors include TensorRT-LLM (NVIDIA, faster on NVIDIA hardware, harder to deploy), TGI (Hugging Face, simpler but slower), and LMDeploy (InternLM team, optimised for InternLM models). vLLM's combination of openness, hardware portability, and OpenAI-API compatibility is why it dominates the open-source serving ecosystem.

Video

Discussed in:

Chapter 15: Modern AI, Engineering at Scale

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).