Maximum Likelihood Estimation, Glossary, Textbook of AI

Also known as: MLE

Maximum likelihood estimation (MLE) chooses model parameters $\theta$ to maximise the probability of observed data $\mathcal{D}$:

$$\hat\theta_\mathrm{MLE} = \arg\max_\theta P(\mathcal{D} \mid \theta).$$

For independent identically distributed observations $\mathcal{D} = \{x_n\}_{n=1}^N$, the likelihood factorises: $P(\mathcal{D} \mid \theta) = \prod_n P(x_n \mid \theta)$. Working in log-space avoids numerical underflow:

$$\hat\theta_\mathrm{MLE} = \arg\max_\theta \sum_{n=1}^N \log P(x_n \mid \theta).$$

The negative log-likelihood (NLL) loss $-\sum_n \log P(x_n \mid \theta)$ is what most supervised models minimise. Under standard regularity conditions, MLE is consistent (converges to the true parameter as $N \to \infty$), asymptotically normal, and asymptotically efficient (achieves the Cramér–Rao lower bound on variance).

Examples:

Gaussian with known variance: MLE of $\mu$ is the sample mean.
Bernoulli/Binomial: MLE of $p$ is the sample proportion.
Linear regression with Gaussian noise: MLE is ordinary least squares.
Logistic regression: MLE of weights, no closed form, solved by gradient descent.
Language models: training on next-token cross-entropy IS MLE on the token distribution.

MLE has well-known weaknesses: it can overfit with small samples (Bayesian methods with informative priors generalise better); it provides only a point estimate, no uncertainty quantification (Bayesian posterior distributions do); for some models (mixtures, latent-variable models) the likelihood has multiple local maxima and EM or other techniques are needed.

Every modern neural network training run is MLE in disguise, minimising cross-entropy on labels or NLL of the next token.

Video

Discussed in:

Chapter 5: Statistics, Maximum Likelihood Estimation
Chapter 5: Statistics, Statistics

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).