BLEU (Bilingual Evaluation Understudy, Papineni et al. 2002) is the standard automatic evaluation metric for machine translation. It compares a candidate translation to one or more reference translations by counting matching n-grams.
Modified n-gram precision: for n-gram length $n$, count how many of the candidate's n-grams appear in any reference, with clipping: each n-gram is counted at most as many times as it appears in any single reference (preventing reward for repetition).
$$p_n = \frac{\sum_\mathrm{candidate} \min(\mathrm{count}_\mathrm{cand}, \mathrm{max}_\mathrm{refs})}{\sum_\mathrm{candidate} \mathrm{count}_\mathrm{cand}}$$
BLEU score:
$$\mathrm{BLEU} = \mathrm{BP} \cdot \exp\!\left(\sum_{n=1}^N w_n \log p_n\right)$$
where typically $N = 4$ (1-grams through 4-grams), $w_n = 1/N$, and BP is the brevity penalty:
$$\mathrm{BP} = \begin{cases} 1 & \text{if } c > r \\ \exp(1 - r/c) & \text{if } c \leq r \end{cases}$$
with $c$ candidate length and $r$ reference length. The brevity penalty discourages short translations, which would otherwise inflate precision.
BLEU is between 0 and 1; commonly reported as a percentage (BLEU-4 of 0.25 = "BLEU 25"). Higher is better.
Properties:
- Corpus-level metric: averaged across many sentences, with statistical reliability for population comparisons.
- Sentence-level BLEU is noisy, meaningful comparisons require corpus-level aggregation over hundreds of sentences.
- Strongly correlates with human judgments at the system level, the original 2002 paper's main empirical claim, validated repeatedly since.
- Poorly correlates at the sentence level, for ranking individual translations BLEU is weak.
Limitations:
- Surface-form matching only: doesn't handle paraphrase, synonym, word-order variation.
- No semantic understanding: a slight word change can dramatically lower BLEU even when the meaning is preserved.
- Reference-set dependence: with one reference, valid alternative translations score zero.
Modern alternatives:
- METEOR: incorporates synonym matching, stemming.
- BERTScore (Zhang et al. 2019): uses BERT embeddings to compute soft similarity rather than exact match. Strongly preferred for modern evaluation.
- BLEURT (Sellam, Das & Parikh 2020): learned metric trained on human judgments.
- COMET (Rei, Stewart, Farinha & Lavie 2020): learned metric specific to MT quality estimation.
For text generation in general:
- ROUGE (Lin 2004): variant for summarisation, recall-oriented.
- chrF: character-level n-gram F-score.
BLEU is now considered a baseline rather than a state-of-the-art metric, but it remains widely reported because of its stability and decades of established benchmarks.
Related terms: Cross-Entropy Loss, Machine Translation
Discussed in:
- Chapter 12: Sequence Models, Sequence Models