Also known as: MLE
Maximum likelihood estimation (MLE) chooses model parameters $\theta$ to maximise the probability of observed data $\mathcal{D}$:
$$\hat\theta_\mathrm{MLE} = \arg\max_\theta P(\mathcal{D} \mid \theta).$$
For independent identically distributed observations $\mathcal{D} = \{x_n\}_{n=1}^N$, the likelihood factorises: $P(\mathcal{D} \mid \theta) = \prod_n P(x_n \mid \theta)$. Working in log-space avoids numerical underflow:
$$\hat\theta_\mathrm{MLE} = \arg\max_\theta \sum_{n=1}^N \log P(x_n \mid \theta).$$
The negative log-likelihood (NLL) loss $-\sum_n \log P(x_n \mid \theta)$ is what most supervised models minimise. Under standard regularity conditions, MLE is consistent (converges to the true parameter as $N \to \infty$), asymptotically normal, and asymptotically efficient (achieves the Cramér–Rao lower bound on variance).
Examples:
- Gaussian with known variance: MLE of $\mu$ is the sample mean.
- Bernoulli/Binomial: MLE of $p$ is the sample proportion.
- Linear regression with Gaussian noise: MLE is ordinary least squares.
- Logistic regression: MLE of weights, no closed form, solved by gradient descent.
- Language models: training on next-token cross-entropy IS MLE on the token distribution.
MLE has well-known weaknesses: it can overfit with small samples (Bayesian methods with informative priors generalise better); it provides only a point estimate, no uncertainty quantification (Bayesian posterior distributions do); for some models (mixtures, latent-variable models) the likelihood has multiple local maxima and EM or other techniques are needed.
Every modern neural network training run is MLE in disguise, minimising cross-entropy on labels or NLL of the next token.
Video
Related terms: Bayes' Theorem, Cross-Entropy Loss, Logistic Regression, MAP Estimation
Discussed in:
- Chapter 5: Statistics, Maximum Likelihood Estimation
- Chapter 5: Statistics, Statistics