A Language Model assigns a probability to a sequence of tokens. Given a sequence $w_1, w_2, \ldots, w_T$, the model estimates $P(w_1, \ldots, w_T) = \prod_{t=1}^T P(w_t \mid w_1, \ldots, w_{t-1})$, factorising via the chain rule of probability. This autoregressive factorisation is the foundation: the model is trained to predict the next token given all preceding tokens, and the total sequence probability is the product of conditionals.
Early language models used n-gram statistics: the probability of a word given the previous $n-1$ words, estimated from corpus counts with smoothing techniques like Kneser-Ney. Neural language models (Bengio et al., 2003) replaced count tables with neural networks that exploit the similarity structure of word embeddings, generalising to unseen contexts. RNN language models removed the fixed-context limitation by using a hidden state to maintain an unbounded history. Transformer language models (GPT family) scale to billions of parameters and achieve dramatic improvements through parallel training and attention.
Modern Large Language Models (LLMs) trained on trillions of tokens have exhibited emergent capabilities: in-context learning, chain-of-thought reasoning, few-shot adaptation. The scaling laws of Kaplan et al. (2020) and Hoffmann et al. (2022, the "Chinchilla" paper) showed that performance improves as a power law of model size, data, and compute. Instruction tuning and RLHF transform raw language models into conversational assistants. The remarkable fact is that the core task—predict the next token—has remained unchanged from Bengio's 2003 model, yet scaling has transformed language modelling from a specialised technique into a general reasoning engine.
Related terms: Large Language Model, GPT, BERT, Recurrent Neural Network, Transformer
Discussed in:
- Chapter 12: Sequence Models — Language Models
Also defined in: Textbook of AI