BLEU, Glossary, Textbook of AI

BLEU (Bilingual Evaluation Understudy, Papineni et al. 2002) is the standard automatic evaluation metric for machine translation. It compares a candidate translation to one or more reference translations by counting matching n-grams.

Modified n-gram precision: for n-gram length $n$, count how many of the candidate's n-grams appear in any reference, with clipping: each n-gram is counted at most as many times as it appears in any single reference (preventing reward for repetition).

$$p_n = \frac{\sum_\mathrm{candidate} \min(\mathrm{count}_\mathrm{cand}, \mathrm{max}_\mathrm{refs})}{\sum_\mathrm{candidate} \mathrm{count}_\mathrm{cand}}$$

BLEU score:

$$\mathrm{BLEU} = \mathrm{BP} \cdot \exp\!\left(\sum_{n=1}^N w_n \log p_n\right)$$

where typically $N = 4$ (1-grams through 4-grams), $w_n = 1/N$, and BP is the brevity penalty:

$$\mathrm{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp(1 - r/c) & \text{if } c \leq r \end{cases}$$

with $c$ candidate length and $r$ reference length. The brevity penalty discourages short translations, which would otherwise inflate precision.

BLEU is between 0 and 1; commonly reported as a percentage (BLEU-4 of 0.25 = "BLEU 25"). Higher is better.

Properties:

Corpus-level metric: averaged across many sentences, with statistical reliability for population comparisons.
Sentence-level BLEU is noisy, meaningful comparisons require corpus-level aggregation over hundreds of sentences.
Strongly correlates with human judgments at the system level, the original 2002 paper's main empirical claim, validated repeatedly since.
Poorly correlates at the sentence level, for ranking individual translations BLEU is weak.

Limitations:

Surface-form matching only: doesn't handle paraphrase, synonym, word-order variation.
No semantic understanding: a slight word change can dramatically lower BLEU even when the meaning is preserved.
Reference-set dependence: with one reference, valid alternative translations score zero.

Modern alternatives:

METEOR: incorporates synonym matching, stemming.
BERTScore (Zhang et al. 2019): uses BERT embeddings to compute soft similarity rather than exact match. Strongly preferred for modern evaluation.
BLEURT (Sellam, Das & Parikh 2020): learned metric trained on human judgments.
COMET (Rei, Stewart, Farinha & Lavie 2020): learned metric specific to MT quality estimation.

For text generation in general:

ROUGE (Lin 2004): variant for summarisation, recall-oriented.
chrF: character-level n-gram F-score.

BLEU is now considered a baseline rather than a state-of-the-art metric, but it remains widely reported because of its stability and decades of established benchmarks.

Related terms: Cross-Entropy Loss, Machine Translation

Discussed in:

Chapter 12: Sequence Models, Sequence Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).