Top-k Sampling, Glossary, Textbook of AI

Top-$k$ sampling for autoregressive language models: at each generation step, restrict the candidate set to the $k$ most probable tokens, renormalise their probabilities to sum to 1, and sample.

Algorithm: at step $t$ with logits $z = (z_1, \ldots, z_V)$:

Find the indices $\mathcal{T}_k$ of the $k$ largest entries.
Form modified logits $\tilde z_i = z_i$ for $i \in \mathcal{T}_k$, $\tilde z_i = -\infty$ otherwise.
Sample $x_{t+1} \sim \mathrm{softmax}(\tilde z / T)$ where $T$ is temperature.

Strengths:

Eliminates the long tail of unlikely tokens. The model often assigns small but non-trivial probability to many irrelevant tokens; top-$k$ truncates these.
More natural-sounding output than greedy or beam search; more focused than pure sampling.

Weaknesses:

Fixed $k$ doesn't adapt to context. For high-confidence predictions ("the" → "United States"), $k = 50$ wastes mass on implausible alternatives. For high-entropy predictions ("write a creative story"), $k = 50$ may exclude valid options.
This motivates top-$p$ (nucleus) sampling.

Typical values: $k = 40$ to $100$ for general-purpose LLM generation.

Combined with temperature: top-$k$ first, then temperature softmax over the truncated set. Common in production deployments.

Top-$k$ filtering for inference efficiency: even when not needed for sampling quality, keeping only the top $k$ logits per step saves compute on the softmax, relevant for very large vocabularies.

Discussed in:

Chapter 12: Sequence Models, Sequence Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).