Glossary

Greedy Decoding

Greedy decoding is the simplest decoding method for autoregressive sequence models. At each step:

$$x_{t+1} = \arg\max_w P(w | x_{1:t})$$

Take the most probable next token, append, repeat until end-of-sequence or maximum length.

Strengths:

  • Deterministic: same input → same output (modulo numerics).
  • Fast: one forward pass per token, no overhead.
  • Implementation simple: argmax over vocabulary.

Weaknesses:

  • Locally optimal but globally sub-optimal: high-probability next tokens can lead the sequence into low-probability regions. The optimal sequence often requires accepting a slightly lower-probability token to avoid getting stuck.
  • Repetition pathology: greedy decoding on language models often produces "the the the the" or "$x$ said $x$ said $x$ said" loops, because once a high-probability sequence pattern starts, every continuation reinforces it.
  • Boring outputs: high-probability text is by definition typical, lacking the variety humans expect.

Use cases:

  • Tasks with a single correct answer: classification, structured prediction, mathematical computation, code generation with verification.
  • Reproducibility: scientific evaluation requiring deterministic outputs.
  • Beam search width 1: equivalent.

Avoid for: open-ended creative generation, dialogue, text completion, where sampling-based methods (top-$p$, top-$k$) produce dramatically more natural and varied outputs.

In modern LLM APIs: setting temperature to 0 (or near-0) yields greedy decoding. This is the standard for "deterministic" inference modes.

Repetition penalty (Keskar et al. 2019): a soft fix for repetition without abandoning greedy decoding, divide the logits of recently-generated tokens by a penalty factor $\rho > 1$ before argmax. With $\rho = 1.1$ or so, sustained repetition is suppressed without sacrificing the determinism of greedy.

Argmax of softmax = argmax of logits: numerically, no need to compute softmax to do greedy decoding. Just the argmax of the logits is sufficient. Saves a small amount of compute and avoids floating-point issues.

Related terms: Beam Search, Top-k Sampling, Top-p (Nucleus) Sampling, Language Model

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).