Glossary

Beam Search

Beam search is a heuristic search for decoding sequence models. At each step it maintains the top-$k$ partial sequences ("the beam") according to cumulative log-probability, expanding each by one token and keeping only the best $k$ overall. Width $k$ trades quality against compute.

Algorithm:

  1. Initialise the beam with the start token: $B_0 = \{(\langle\mathrm{start}\rangle, 0)\}$.
  2. At step $t$, for each $(s, \log p) \in B_{t-1}$ and each candidate token $w$:
    • Form the extended sequence $s \cdot w$ with cumulative score $\log p + \log P(w | s)$.
  3. Keep the top-$k$ extended sequences as $B_t$.
  4. Sequences that produce the end-of-sequence token are removed from the active beam and added to a finished pool.
  5. Stop when all beams are finished or maximum length reached.
  6. Return the highest-scoring finished sequence.

Greedy decoding is the special case $k = 1$: at each step, pick the most probable next token. Fast but typically produces repetitive or sub-optimal output.

Length normalisation: cumulative log-probability inherently penalises long sequences (more terms summed). Standard fixes:

  • Average log-probability: divide by length.
  • Length-normalised score $\frac{1}{(5 + l)^\alpha / 6^\alpha} \log p$ (Wu et al. 2016, Google's NMT).

Diverse beam search (Vijayakumar et al. 2018): partition the beam into groups, with each group's selection penalised by similarity to the others. Produces more varied outputs.

Constrained beam search: hard constraints (must include certain words) or soft constraints (penalise certain patterns) can be incorporated into the beam scoring.

In machine translation, beam search with $k = 4$ to $10$ has been the standard decoding method for two decades.

In modern LLMs, beam search has fallen out of favour for open-ended generation. Reasons:

  • Repetition / dullness: high-probability sequences are often boring.
  • Mode collapse: large beams converge to similar outputs.
  • Sampling-based methods (top-$k$, top-$p$, temperature) produce more diverse and human-like text.

Beam search remains the standard for constrained generation (e.g. machine translation, code generation, structured output, retrieval-augmented question answering) where high probability is the right objective.

Time complexity: $O(T \cdot k \cdot V)$ for $T$ steps and vocabulary $V$. For $V$ in the tens of thousands and $k = 5$, this is comparable to several greedy passes. Top-$k$ logit selection can avoid materialising the full vocabulary score for non-candidates.

Related terms: Language Model, Greedy Decoding, Top-k Sampling

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).