Glossary

Top-k Sampling

Top-$k$ sampling for autoregressive language models: at each generation step, restrict the candidate set to the $k$ most probable tokens, renormalise their probabilities to sum to 1, and sample.

Algorithm: at step $t$ with logits $z = (z_1, \ldots, z_V)$:

  1. Find the indices $\mathcal{T}_k$ of the $k$ largest entries.
  2. Form modified logits $\tilde z_i = z_i$ for $i \in \mathcal{T}_k$, $\tilde z_i = -\infty$ otherwise.
  3. Sample $x_{t+1} \sim \mathrm{softmax}(\tilde z / T)$ where $T$ is temperature.

Strengths:

  • Eliminates the long tail of unlikely tokens. The model often assigns small but non-trivial probability to many irrelevant tokens; top-$k$ truncates these.
  • More natural-sounding output than greedy or beam search; more focused than pure sampling.

Weaknesses:

  • Fixed $k$ doesn't adapt to context. For high-confidence predictions ("the" → "United States"), $k = 50$ wastes mass on implausible alternatives. For high-entropy predictions ("write a creative story"), $k = 50$ may exclude valid options.
  • This motivates top-$p$ (nucleus) sampling.

Typical values: $k = 40$ to $100$ for general-purpose LLM generation.

Combined with temperature: top-$k$ first, then temperature softmax over the truncated set. Common in production deployments.

Top-$k$ filtering for inference efficiency: even when not needed for sampling quality, keeping only the top $k$ logits per step saves compute on the softmax, relevant for very large vocabularies.

Related terms: Softmax, Language Model, Top-p (Nucleus) Sampling, Temperature (sampling)

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).