12.2 Discrete sequences and the language-modelling problem
Throughout this chapter we will repeatedly invoke the language-modelling problem as a concrete instance of the more general problem of probabilistic sequence modelling. Fix a finite vocabulary $\mathcal{V} = \{w_1, \ldots, w_V\}$, for now, take the words as given; we revisit how the vocabulary is constructed in §12.5. A language model is a probability distribution $P$ over all finite sequences of vocabulary elements:
$$P : \mathcal{V}^* \to [0, 1], \qquad \sum_{x \in \mathcal{V}^*} P(x) = 1.$$
By the chain rule of probability, any joint distribution over a sequence factorises into a product of conditionals,
$$P(w_1, w_2, \ldots, w_T) = \prod_{t=1}^{T} P(w_t \mid w_1, \ldots, w_{t-1}).$$
This decomposition is exact and assumption-free; it just rearranges the joint. What we then do (n-gram models, RNN language models, Transformer language models) is propose a particular parametric family for each conditional $P(w_t \mid w_{1:t-1})$. The whole game of language modelling is to find a flexible, tractable family for the next-token distribution.
Two technical conventions matter. First, sequences usually have explicit beginning-of-sequence ($\langle\text{BOS}\rangle$) and end-of-sequence ($\langle\text{EOS}\rangle$) tokens, so that $P(\text{empty})$ and $P(\text{ends here})$ are well-defined and the model can learn to terminate. Second, causality: the conditional at position $t$ may depend only on positions $1, \ldots, t-1$. Architectures that violate causality (for example, bidirectional RNNs) compute representations useful for tagging or classification, but they do not define a generative language model in the chain-rule sense.
Given a corpus $\mathcal{D} = \{x^{(1)}, \ldots, x^{(N)}\}$ of $N$ training sequences (sentences, documents), the standard training objective is maximum likelihood: maximise
$$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \log P_\theta\left(x^{(i)}\right) = \frac{1}{N} \sum_{i=1}^{N} \sum_{t=1}^{T_i} \log P_\theta\left(x^{(i)}_t \mid x^{(i)}_{\lt t}\right).$$
This is, equivalently, minimising the average cross-entropy between the empirical token distribution and the model. The standard evaluation metric is perplexity:
$$\mathrm{PPL}(P_\theta; \mathcal{D}) = \exp\!\left( -\frac{1}{M} \sum_{i=1}^{N} \sum_{t=1}^{T_i} \log P_\theta\left(x^{(i)}_t \mid x^{(i)}_{\lt t}\right) \right),$$
where $M = \sum_i T_i$ is the total number of tokens. Perplexity is the exponential of the average per-token negative log-likelihood, measured in nats if logs are natural and bits if logs are base-2; a model with perplexity $K$ is, in an information-theoretic sense, "as confused as if it had to choose uniformly among $K$ alternatives at every step". Lower is better; the lower bound is the entropy of the data-generating process. As a sanity check, a uniform distribution over $V$ tokens has perplexity exactly $V$.