Top-$k$ sampling for autoregressive language models: at each generation step, restrict the candidate set to the $k$ most probable tokens, renormalise their probabilities to sum to 1, and sample.
Algorithm: at step $t$ with logits $z = (z_1, \ldots, z_V)$:
- Find the indices $\mathcal{T}_k$ of the $k$ largest entries.
- Form modified logits $\tilde z_i = z_i$ for $i \in \mathcal{T}_k$, $\tilde z_i = -\infty$ otherwise.
- Sample $x_{t+1} \sim \mathrm{softmax}(\tilde z / T)$ where $T$ is temperature.
Strengths:
- Eliminates the long tail of unlikely tokens. The model often assigns small but non-trivial probability to many irrelevant tokens; top-$k$ truncates these.
- More natural-sounding output than greedy or beam search; more focused than pure sampling.
Weaknesses:
- Fixed $k$ doesn't adapt to context. For high-confidence predictions ("the" → "United States"), $k = 50$ wastes mass on implausible alternatives. For high-entropy predictions ("write a creative story"), $k = 50$ may exclude valid options.
- This motivates top-$p$ (nucleus) sampling.
Typical values: $k = 40$ to $100$ for general-purpose LLM generation.
Combined with temperature: top-$k$ first, then temperature softmax over the truncated set. Common in production deployments.
Top-$k$ filtering for inference efficiency: even when not needed for sampling quality, keeping only the top $k$ logits per step saves compute on the softmax, relevant for very large vocabularies.
Related terms: Softmax, Language Model, Top-p (Nucleus) Sampling, Temperature (sampling)
Discussed in:
- Chapter 12: Sequence Models, Sequence Models