Speculative decoding (Leviathan et al. and Chen et al., 2023) accelerates LLM inference by using a small draft model to propose tokens that the large target model verifies in parallel. When the draft proposals are accepted, multiple tokens are generated per forward pass of the target model, dramatically reducing latency.
Algorithm:
Given target model $p$ and draft model $q$ (typically a much smaller model trained to approximate $p$):
Draft model autoregressively generates $K$ candidate tokens $x_1', x_2', \ldots, x_K'$ following $q$.
Target model $p$ runs one forward pass on the full draft sequence, producing $p(x_t' | x_{\lt t}')$ for each position $t$.
For each draft position $t = 1, 2, \ldots, K$:
a. Accept $x_t'$ with probability $\min(1, p(x_t' | x_{\lt t}') / q(x_t' | x_{\lt t}'))$.
b. If rejected: sample a replacement from the residual distribution
$$P_\mathrm{resample}(x) \propto \max(0, p(x | x_{\lt t}') - q(x | x_{\lt t}'))$$
and stop. The accepted prefix plus this replacement become the next chunk.
If all $K$ drafts accepted, sample one additional token from $p$ at position $K+1$. Repeat with the now-extended context.
Provable correctness: under this protocol, the distribution of accepted tokens is exactly $p$, the target model's distribution. Speculative decoding gives mathematically identical samples to standard target-model decoding, with no quality loss.
Speedup: depends on draft-target agreement rate $\alpha$, the probability that a draft token is accepted. The expected number of tokens generated per target-model forward pass is approximately $(1 - \alpha^{K+1}) / (1 - \alpha)$. For typical values $\alpha \approx 0.7$, $K = 5$, this gives around 3-4× speedup.
Draft model choices:
- Smaller version of the target architecture: e.g. 7B drafting for 70B target. Standard.
- Tiny n-gram model plus prompt-attached cache: simple, sometimes effective.
- Self-speculation (Medusa, EAGLE): use the target model's earlier hidden states to predict multiple future tokens in parallel, no separate draft model.
Tree-based variants (Medusa, SpecInfer): draft model proposes a tree of candidate continuations rather than a single sequence; target model verifies many paths in parallel via shared attention computation.
Production deployment: vLLM, TGI, TensorRT-LLM, SGLang all support speculative decoding. The technique is now standard in commercial LLM serving for latency-sensitive applications.
Limitations:
- Requires a draft model whose distribution is reasonably close to the target's , otherwise rejection rate is high and overhead exceeds savings.
- Memory overhead: both models must be in GPU memory.
- Diminishing returns at very high $K$.
Speculative decoding is one of the cleanest examples of inference-time optimisation that gives speedups without quality trade-offs.
Related terms: Language Model, Bayesian Inference
Discussed in:
- Chapter 15: Modern AI, Modern AI