Top-p (Nucleus) Sampling, Glossary, Textbook of AI

Also known as: nucleus sampling

Top-$p$ sampling (also called nucleus sampling, Holtzman et al. 2019) is an adaptive alternative to top-$k$. At each step:

Sort the vocabulary by predicted probability, descending.
Find the smallest set $\mathcal{T}_p$ whose cumulative probability mass exceeds $p$:

$$\sum_{w \in \mathcal{T}_p} P(w) \geq p$$

The set $\mathcal{T}_p$ is the nucleus.
Restrict to this nucleus, renormalise, and sample.

Typical $p \in [0.85, 0.95]$.

Why nucleus is preferable to top-$k$:

The vocabulary's probability distribution shape varies with context. After "the United States of" the model assigns nearly 100% probability to "America", with a long irrelevant tail. After "Write a haiku about" the distribution is much flatter, many valid continuations.

Top-$k$ with fixed $k$:

At "United States of": $k = 40$ includes 39 nonsensical alternatives.
At "Write a haiku about": $k = 40$ may exclude valid creative options.

Top-$p$ adapts:

At "United States of": nucleus is just {"America"}, accurate, focused.
At "Write a haiku about": nucleus is large, allowing creative variety.

Holtzman et al.'s evaluation showed nucleus sampling produces text more similar to human writing in vocabulary, n-gram statistics, and perplexity than top-$k$, beam search or pure sampling.

Combined with temperature: $\mathrm{softmax}(z / T)$ first, then nucleus filter. Modern OpenAI and Anthropic APIs typically apply temperature, then top-$p$, then sample.

Combined with top-$k$: take intersection of top-$k$ and top-$p$ candidates. Common safety net, top-$k$ caps the maximum vocabulary size, top-$p$ adapts within it.

Why "nucleus": the chosen set is the high-probability "core" or "nucleus" of the distribution, excluding the tail.

In production LLM serving, nucleus is the default sampling strategy. OpenAI's top_p parameter, Anthropic's top_p, vLLM's nucleus sampling all implement this.

Related terms: Top-k Sampling, Language Model, Temperature (sampling)

Discussed in:

Chapter 12: Sequence Models, Sequence Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).