Sequence Models: 12.3   N-gram language models

Dr Chris Paton

12.3 N-gram language models

The Markov assumption truncates the full conditional to a fixed window of length $n - 1$:

$$P(w_t \mid w_1, \ldots, w_{t-1}) \approx P(w_t \mid w_{t-n+1}, \ldots, w_{t-1}).$$

This is the n-gram approximation. With $n = 1$ (the unigram model) the next word is independent of context. With $n = 2$ (bigrams) the next word depends only on the previous word; with $n = 3$ (trigrams) on the previous two; and so on. The maximum-likelihood estimator is the empirical relative frequency,

$$\hat P_{\mathrm{MLE}}(w_t \mid w_{t-n+1:t-1}) = \frac{c(w_{t-n+1:t})}{c(w_{t-n+1:t-1})},$$

where $c(\cdot)$ denotes the count of the given n-gram in the training corpus.

This estimator has two well-known pathologies. First, sparsity: the number of possible n-grams is $V^n$, and even a corpus of billions of tokens contains only a vanishing fraction of them. For any unseen n-gram, the MLE assigns probability zero, which makes the joint probability of any sentence containing it zero, which makes its log-likelihood $-\infty$, which breaks training and evaluation. Second, brittleness: even when an n-gram has been seen, its count may be low and its empirical probability poorly estimated.

The remedy is smoothing: redistribute probability mass from observed events to unseen ones. The simplest form, Laplace or add-one smoothing, replaces $c(w_{t-n+1:t})$ with $c(w_{t-n+1:t}) + 1$ and adds $V$ to the denominator. This works but over-smooths heavily; production systems use more refined estimators. Kneser–Ney smoothing in particular, in its modified or interpolated forms, was the state of the art for n-gram models for decades. The intuition behind Kneser–Ney is to estimate the probability of a word in a backed-off lower-order context not by its raw count but by the number of distinct contexts in which it appears: "Francisco" has a high raw count because it almost always follows "San", but it appears in very few distinct contexts and so is a poor unigram bet.

Worked example. Suppose the training corpus is the single sentence "the cat sat on the mat". After adding $\langle\text{BOS}\rangle$ and $\langle\text{EOS}\rangle$ tokens, the bigram counts are: $c(\langle\text{BOS}\rangle, \text{the}) = 1$, $c(\text{the}, \text{cat}) = 1$, $c(\text{cat}, \text{sat}) = 1$, $c(\text{sat}, \text{on}) = 1$, $c(\text{on}, \text{the}) = 1$, $c(\text{the}, \text{mat}) = 1$, $c(\text{mat}, \langle\text{EOS}\rangle) = 1$, with $c(\text{the}) = 2$ and all other unigrams equal to $1$ (counted as left-contexts of bigrams). Under MLE, $\hat P(\text{cat} \mid \text{the}) = c(\text{the}, \text{cat}) / c(\text{the}) = 1/2$, and $\hat P(\text{mat} \mid \text{the}) = 1/2$, while $\hat P(\text{sat} \mid \text{the}) = 0$. To score the test sentence "the cat sat", we compute

$$\hat P(\text{the cat sat}) = \hat P(\text{the} \mid \langle\text{BOS}\rangle) \cdot \hat P(\text{cat} \mid \text{the}) \cdot \hat P(\text{sat} \mid \text{cat}) \cdot \hat P(\langle\text{EOS}\rangle \mid \text{sat}) = 1 \cdot \tfrac{1}{2} \cdot 1 \cdot 0 = 0,$$

because $\hat P(\langle\text{EOS}\rangle \mid \text{sat}) = 0$, "sat" was always followed by "on" in training. Add-one smoothing with vocabulary size $V = 7$ (counting BOS and EOS) gives $\hat P_{+1}(\langle\text{EOS}\rangle \mid \text{sat}) = (0 + 1) / (1 + 7) = 1/8$, restoring a non-zero score.

Three other limitations of n-gram models, beyond smoothing, motivate everything that follows.

Fixed and short context. Empirically, going from trigrams to 5-grams or 6-grams gives diminishing returns for exponentially increasing data and storage costs. Yet many linguistically obvious dependencies, subject–verb agreement across a relative clause, anaphora resolution, span more than five tokens.

No generalisation across similar words. The model treats every word as an atomic symbol. Having seen "the cat sat" gives it no preferential probability for "the dog sat"; those two contexts are as unrelated as "the cat sat" and "the airplane sat".

Storage and computation. Storing all 5-gram counts for a vocabulary of $10^5$ words exceeds practical memory unless one uses sophisticated pruning and compression. Modern n-gram toolkits (KenLM and similar) get around this with elegant data structures, but the asymptotic problem is unavoidable.

These three problems, context length, generalisation across words, and parameter explosion, are exactly what neural language models address.

12.3.1 Evaluating a language model: held-out perplexity in practice

When evaluating an n-gram (or any) language model on a held-out test corpus, three subtleties matter.

Out-of-vocabulary handling. The training vocabulary is finite. Any test-set word never seen in training has zero probability under MLE, and even smoothed n-gram models typically map all such words to a single $\langle\text{UNK}\rangle$ symbol. Reported perplexities therefore depend on the choice of vocabulary, and comparing perplexities across systems requires that they share the same vocabulary and tokenisation.

Per-word vs per-token perplexity. A subword model (BPE, WordPiece) emits more tokens than the underlying word count. Its per-token perplexity is therefore lower than its per-word perplexity. Standard practice for comparing modern subword language models to historical word-level baselines is to convert: divide total negative log-likelihood by the number of space-delimited words, not subword tokens, and exponentiate.

Sentence boundaries. The cost of predicting the end-of-sentence token is non-negligible, about 1 bit on average for natural English. Whether $\langle\text{EOS}\rangle$ is included in the perplexity computation matters for short documents and is a common source of incomparable numbers in the literature.

These caveats are not abstract: a perplexity of 50 reported with one set of conventions can correspond to a perplexity of 80 under another. Always read the methodology section.