Beam search is a heuristic search for decoding sequence models. At each step it maintains the top-$k$ partial sequences ("the beam") according to cumulative log-probability, expanding each by one token and keeping only the best $k$ overall. Width $k$ trades quality against compute.
Algorithm:
- Initialise the beam with the start token: $B_0 = \{(\langle\mathrm{start}\rangle, 0)\}$.
- At step $t$, for each $(s, \log p) \in B_{t-1}$ and each candidate token $w$:
- Form the extended sequence $s \cdot w$ with cumulative score $\log p + \log P(w | s)$.
- Keep the top-$k$ extended sequences as $B_t$.
- Sequences that produce the end-of-sequence token are removed from the active beam and added to a finished pool.
- Stop when all beams are finished or maximum length reached.
- Return the highest-scoring finished sequence.
Greedy decoding is the special case $k = 1$: at each step, pick the most probable next token. Fast but typically produces repetitive or sub-optimal output.
Length normalisation: cumulative log-probability inherently penalises long sequences (more terms summed). Standard fixes:
- Average log-probability: divide by length.
- Length-normalised score $\frac{1}{(5 + l)^\alpha / 6^\alpha} \log p$ (Wu et al. 2016, Google's NMT).
Diverse beam search (Vijayakumar et al. 2018): partition the beam into groups, with each group's selection penalised by similarity to the others. Produces more varied outputs.
Constrained beam search: hard constraints (must include certain words) or soft constraints (penalise certain patterns) can be incorporated into the beam scoring.
In machine translation, beam search with $k = 4$ to $10$ has been the standard decoding method for two decades.
In modern LLMs, beam search has fallen out of favour for open-ended generation. Reasons:
- Repetition / dullness: high-probability sequences are often boring.
- Mode collapse: large beams converge to similar outputs.
- Sampling-based methods (top-$k$, top-$p$, temperature) produce more diverse and human-like text.
Beam search remains the standard for constrained generation (e.g. machine translation, code generation, structured output, retrieval-augmented question answering) where high probability is the right objective.
Time complexity: $O(T \cdot k \cdot V)$ for $T$ steps and vocabulary $V$. For $V$ in the tens of thousands and $k = 5$, this is comparable to several greedy passes. Top-$k$ logit selection can avoid materialising the full vocabulary score for non-candidates.
Related terms: Language Model, Greedy Decoding, Top-k Sampling
Discussed in:
- Chapter 12: Sequence Models, Sequence Models