Also known as: nucleus sampling
Top-$p$ sampling (also called nucleus sampling, Holtzman et al. 2019) is an adaptive alternative to top-$k$. At each step:
Sort the vocabulary by predicted probability, descending.
Find the smallest set $\mathcal{T}_p$ whose cumulative probability mass exceeds $p$:
$$\sum_{w \in \mathcal{T}_p} P(w) \geq p$$
The set $\mathcal{T}_p$ is the nucleus.
Restrict to this nucleus, renormalise, and sample.
Typical $p \in [0.85, 0.95]$.
Why nucleus is preferable to top-$k$:
The vocabulary's probability distribution shape varies with context. After "the United States of" the model assigns nearly 100% probability to "America", with a long irrelevant tail. After "Write a haiku about" the distribution is much flatter, many valid continuations.
Top-$k$ with fixed $k$:
- At "United States of": $k = 40$ includes 39 nonsensical alternatives.
- At "Write a haiku about": $k = 40$ may exclude valid creative options.
Top-$p$ adapts:
- At "United States of": nucleus is just {"America"}, accurate, focused.
- At "Write a haiku about": nucleus is large, allowing creative variety.
Holtzman et al.'s evaluation showed nucleus sampling produces text more similar to human writing in vocabulary, n-gram statistics, and perplexity than top-$k$, beam search or pure sampling.
Combined with temperature: $\mathrm{softmax}(z / T)$ first, then nucleus filter. Modern OpenAI and Anthropic APIs typically apply temperature, then top-$p$, then sample.
Combined with top-$k$: take intersection of top-$k$ and top-$p$ candidates. Common safety net, top-$k$ caps the maximum vocabulary size, top-$p$ adapts within it.
Why "nucleus": the chosen set is the high-probability "core" or "nucleus" of the distribution, excluding the tail.
In production LLM serving, nucleus is the default sampling strategy. OpenAI's top_p parameter, Anthropic's top_p, vLLM's nucleus sampling all implement this.
Related terms: Top-k Sampling, Language Model, Temperature (sampling)
Discussed in:
- Chapter 12: Sequence Models, Sequence Models