Continuous Batching, Glossary, Textbook of AI

Continuous batching (also called iteration-level scheduling or in-flight batching) is an LLM serving technique that schedules requests at the granularity of single generation steps rather than complete sequences. The traditional approach, static batching , collects $B$ requests, pads them to the longest sequence in the batch, runs them all to completion, and only then admits a new batch. This wastes compute on padding and on already-finished sequences that are forced to wait for the slowest member.

Continuous batching, introduced in Orca (Yu et al., 2022) and popularised by vllm, instead treats every token-generation iteration as an independent scheduling decision. At each step the scheduler asks: which active sequences need to advance one token? Are any new requests waiting? Has any request finished? It then assembles a fresh batch tensor from precisely the sequences that need to run this step. A request that finishes at step $t$ exits the batch before step $t+1$, and its slot is immediately filled by an admitted new request, which may be at the prefill phase (processing its prompt) or already in decode (generating the next token).

The mathematical effect on throughput is substantial. Define goodput as useful tokens generated per unit time. Static batching's goodput is bounded by the longest sequence in the batch:

$$\text{goodput}_\mathrm{static} = \frac{\sum_{i=1}^B n_i}{B \cdot \max_i T_i},$$

where $n_i$ is request $i$'s output length and $T_i$ its total generation time. Continuous batching's goodput is

$$\text{goodput}_\mathrm{continuous} = \frac{\sum_i n_i}{\sum_i T_i / B_\mathrm{eff}},$$

where $B_\mathrm{eff}$ is the average concurrent batch size. The improvement factor on workloads with heterogeneous sequence lengths is typically 5–10×.

Continuous batching introduces a complication: prefill and decode have very different compute profiles. Prefill processes all $n_\mathrm{prompt}$ tokens of a new request in parallel, with cost $O(n_\mathrm{prompt}^2)$ for attention, a large compute burst. Decode processes one token per step with cost $O(n_\mathrm{ctx})$ per step, a small but bandwidth-bound operation. Mixing prefill and decode in the same forward pass forces the prefill request to slow down all the decode requests, and vice versa.

Two scheduling strategies address this:

Disaggregated serving (DistServe, Splitwise): physically separate GPU pools for prefill and decode, with the KV cache transferred over the interconnect. This optimises each phase independently at the cost of cache transfer bandwidth.
Chunked prefill (Sarathi-Serve, vLLM v1): break a long prompt into chunks of, say, 512 tokens and interleave them with ongoing decode steps. The prefill chunks fill any spare compute capacity left after decode, smoothing the throughput–latency trade-off.

The throughput–latency trade-off is central: more concurrent requests increase aggregate throughput but also increase each request's per-step latency (since each forward pass is wider). Operators tune the maximum batch size and queue depth to hit a target time-to-first-token (TTFT) and inter-token latency (ITL) for their service level.

Continuous batching is now the default in every production LLM serving stack: vLLM, TensorRT-LLM, TGI (Text Generation Inference), and LMDeploy. Combined with paged-attention and prefix caching, it is the reason LLM inference cost-per-token has fallen by an order of magnitude since 2022.

Related terms: vLLM, PagedAttention, KV Cache, Transformer

Discussed in:

Chapter 15: Modern AI, Engineering at Scale

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).