Beam Search, Glossary, Textbook of AI

Beam search is a heuristic search for decoding sequence models. At each step it maintains the top-$k$ partial sequences ("the beam") according to cumulative log-probability, expanding each by one token and keeping only the best $k$ overall. Width $k$ trades quality against compute.

Algorithm:

Initialise the beam with the start token: $B_0 = \{(\langle\mathrm{start}\rangle, 0)\}$.
At step $t$, for each $(s, \log p) \in B_{t-1}$ and each candidate token $w$:
- Form the extended sequence $s \cdot w$ with cumulative score $\log p + \log P(w | s)$.
Keep the top-$k$ extended sequences as $B_t$.
Sequences that produce the end-of-sequence token are removed from the active beam and added to a finished pool.
Stop when all beams are finished or maximum length reached.
Return the highest-scoring finished sequence.

Greedy decoding is the special case $k = 1$: at each step, pick the most probable next token. Fast but typically produces repetitive or sub-optimal output.

Length normalisation: cumulative log-probability inherently penalises long sequences (more terms summed). Standard fixes:

Average log-probability: divide by length.
Length-normalised score $\frac{1}{(5 + l)^\alpha / 6^\alpha} \log p$ (Wu et al. 2016, Google's NMT).

Diverse beam search (Vijayakumar et al. 2018): partition the beam into groups, with each group's selection penalised by similarity to the others. Produces more varied outputs.

Constrained beam search: hard constraints (must include certain words) or soft constraints (penalise certain patterns) can be incorporated into the beam scoring.

In machine translation, beam search with $k = 4$ to $10$ has been the standard decoding method for two decades.

In modern LLMs, beam search has fallen out of favour for open-ended generation. Reasons:

Repetition / dullness: high-probability sequences are often boring.
Mode collapse: large beams converge to similar outputs.
Sampling-based methods (top-$k$, top-$p$, temperature) produce more diverse and human-like text.

Beam search remains the standard for constrained generation (e.g. machine translation, code generation, structured output, retrieval-augmented question answering) where high probability is the right objective.

Time complexity: $O(T \cdot k \cdot V)$ for $T$ steps and vocabulary $V$. For $V$ in the tens of thousands and $k = 5$, this is comparable to several greedy passes. Top-$k$ logit selection can avoid materialising the full vocabulary score for non-candidates.

Related terms: Language Model, Greedy Decoding, Top-k Sampling

Discussed in:

Chapter 12: Sequence Models, Sequence Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).