Statistics: 5.5   Maximum Likelihood Estimation

Dr Chris Paton

5.5 Maximum Likelihood Estimation

Maximum likelihood estimation, almost universally abbreviated to MLE, is the workhorse of modern statistics and machine learning. The idea is simple enough to state in one sentence: given a parametric model and some observed data, the MLE picks the parameter values that make the observed data most probable. That single sentence quietly underwrites a great deal of what we do when we train a model. Nearly every standard machine-learning loss function, mean squared error for regression, cross-entropy for classification, the Kullback–Leibler divergence used in variational inference and reinforcement learning, even the noise-prediction loss in diffusion models, turns out to be a negative log-likelihood under some assumed probabilistic model. Once you see MLE clearly, a long list of seemingly unrelated algorithms collapses into one principle.

This section zooms in on the most important specific estimator from §5.4. §5.6 then turns to the Bayesian alternative.

We start with the principle, work through concrete examples (Bernoulli, Gaussian, linear regression, logistic regression) until the pattern is clear, then turn to MLE's asymptotic properties and its failure modes (small samples, misspecification, non-convex landscapes).

Symbols Used Here

$\theta$model parameters

$\mathcal{D} = \{x_1, \ldots, x_n\}$i.i.d. observed data

$p(x \mid \theta)$model density (or PMF)

$L(\theta) = \prod_i p(x_i \mid \theta)$likelihood

$\ell(\theta) = \log L(\theta) = \sum_i \log p(x_i \mid \theta)$log-likelihood

$\hat\theta_{\text{MLE}}$the MLE

$I(\theta)$Fisher information

The principle

Suppose we have a parametric family of distributions indexed by $\theta$, and a dataset $\mathcal{D} = \{x_1, \ldots, x_n\}$ assumed to be drawn independently from one member of that family. The likelihood of $\theta$ is the probability (or density) the model assigns to the observed data, viewed as a function of $\theta$: $$L(\theta) = p(\mathcal{D} \mid \theta) = \prod_{i=1}^n p(x_i \mid \theta).$$

Note the change of perspective. The same expression $p(x \mid \theta)$ can be read two ways. As a function of $x$ with $\theta$ fixed, it is a probability distribution that integrates (or sums) to one. As a function of $\theta$ with $x$ fixed, it is the likelihood, and it does not generally integrate to one over $\theta$, because likelihood is not a probability distribution over parameters. It is simply a measure of how well each candidate $\theta$ explains what we saw.

The MLE is the $\theta$ that makes the data as probable as possible: $$\hat\theta_{\text{MLE}} = \arg\max_\theta \log p(\mathcal{D} \mid \theta) = \arg\max_\theta \sum_i \log p(x_i \mid \theta).$$

We work with the log-likelihood $\ell(\theta) = \log L(\theta)$ for three reasons. First, products of many small probabilities underflow to zero in floating-point arithmetic; sums of log-probabilities behave gracefully. Second, sums are easier to differentiate than products, so the score equations $\partial \ell/\partial \theta = 0$ are tractable. Third, because the logarithm is strictly increasing, the maximiser of $L$ is also the maximiser of $\ell$, so we lose nothing.

A small but important rephrasing: minimising the negative log-likelihood (NLL) is the same as maximising the likelihood. Optimisation libraries minimise by convention, so in practice you will almost always see code that computes an NLL and hands it to a minimiser. Whenever you write loss = -log_prob(...) in a deep-learning framework, you are doing MLE.

It is worth pausing to notice what the principle does not assume. It does not assume the data are Gaussian, or even continuous. It does not assume the parameter is a single number. It does not assume the model is correct, only that we have decided to fit one specific family of distributions. And it does not require any prior beliefs about $\theta$, which is precisely the philosophical line dividing frequentist MLE from the Bayesian approach we meet in §5.6. All MLE asks for is a model and a dataset, and it returns a single point estimate.

Worked example: Bernoulli MLE

Suppose we toss a coin $n$ times and observe $k$ heads. We model each toss as $X_i \sim \text{Bernoulli}(\theta)$, where $\theta$ is the unknown probability of heads. The probability mass function is $p(x \mid \theta) = \theta^x (1-\theta)^{1-x}$ for $x \in \{0, 1\}$, so the log-likelihood is $$\ell(\theta) = \sum_i \big[ x_i \log\theta + (1-x_i)\log(1-\theta) \big] = k\log\theta + (n-k)\log(1-\theta).$$

To find the maximum we differentiate, set the derivative to zero, and solve: $$\frac{d\ell}{d\theta} = \frac{k}{\theta} - \frac{n-k}{1-\theta} = 0.$$

Multiplying through by $\theta(1-\theta)$ gives $k(1-\theta) - (n-k)\theta = 0$, which simplifies to $k - n\theta = 0$, so $$\hat\theta_{\text{MLE}} = \frac{k}{n}.$$

The MLE is exactly the empirical fraction of successes. That is reassuring: the principled answer agrees with the answer your intuition would have given you. To check it really is a maximum (and not a saddle or minimum), compute the second derivative: $$\frac{d^2\ell}{d\theta^2} = -\frac{k}{\theta^2} - \frac{n-k}{(1-\theta)^2} < 0.$$

The second derivative is strictly negative whenever $0 < \theta < 1$, so the log-likelihood is concave and the critical point is a unique global maximum. Concavity is a recurring blessing in MLE problems for exponential-family distributions: when it holds, optimisation is easy and there is exactly one answer.

The Fisher information for a single Bernoulli observation works out to $I(\theta) = 1/[\theta(1-\theta)]$, which means the asymptotic variance of $\hat\theta$ is $\theta(1-\theta)/n$, exactly what you would calculate by treating $k/n$ as a sample mean of $n$ Bernoulli random variables. The likelihood machinery has reproduced the answer you would have obtained by elementary means, but it generalises far beyond cases where elementary methods would suffice.

This same calculation, scaled up to many parameters, is the engine behind n-gram language models without smoothing, naive Bayes classifiers and the categorical components of latent Dirichlet allocation: count occurrences, divide by the total, done.

It is also worth flagging what happens at the boundaries. If $k = 0$ the MLE is $\hat\theta = 0$, which says future successes are impossible, not a claim ten coin tosses can really license. If $k = n$ the MLE is $\hat\theta = 1$, similarly extreme. These pathologies foreshadow the discussion of small-sample failure modes later in this section, and motivate the Bayesian alternative in §5.6, where placing a Beta prior on $\theta$ pulls the estimate gently away from the boundary.

Worked example: Gaussian MLE

Now consider $X_1, \ldots, X_n \sim \mathcal{N}(\mu, \sigma^2)$ with both the mean and the variance unknown. The Gaussian density is $p(x \mid \mu, \sigma^2) = (2\pi\sigma^2)^{-1/2} \exp(-(x-\mu)^2/(2\sigma^2))$, so the log-likelihood is $$\ell(\mu, \sigma^2) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_i (x_i - \mu)^2.$$

We have two parameters, so we need two partial derivatives. For the mean, $$\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_i (x_i - \mu) = 0 \implies \hat\mu = \frac{1}{n}\sum_i x_i = \bar X.$$

The MLE for the mean is the sample mean, again, the obvious answer.

For the variance, treating $\sigma^2$ (not $\sigma$) as the parameter, $$\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4}\sum_i (x_i - \hat\mu)^2 = 0,$$ which rearranges to $$\hat\sigma^2_{\text{MLE}} = \frac{1}{n}\sum_i (x_i - \hat\mu)^2.$$

This is the biased sample variance. The unbiased estimator that you may have met first divides by $n-1$ instead of $n$, the so-called Bessel correction. The MLE knows nothing about Bessel; the $1/n$ falls out of the algebra. The bias arises because we used $\hat\mu$ rather than the true $\mu$ in the sum of squared deviations, and replacing $\mu$ with its sample estimate slightly underestimates the spread on average. For large $n$ the difference between $1/n$ and $1/(n-1)$ vanishes, so the MLE is asymptotically unbiased, but for small samples it is a real effect.

This Gaussian calculation hides a clue that becomes loud once you see it. Because $\ell$ depends on $(x_i - \mu)^2$ summed across observations, maximising the Gaussian log-likelihood with $\sigma^2$ fixed is identical to minimising the sum of squared deviations. Squared error and Gaussian likelihood are two views of the same object. That equivalence is the bridge to the next two examples.

A practical note: in code we usually parameterise the model by $\log \sigma$ rather than $\sigma$ or $\sigma^2$, so that the optimiser can move freely across the real line without ever proposing a negative or zero variance. This trick, reparameterising a constrained parameter to live on $\mathbb{R}$, is ubiquitous in modern machine learning, and it grows directly out of likelihood thinking. The same logic underlies the softmax (which reparameterises probabilities as unconstrained logits), the Cholesky parameterisation of covariance matrices and the bounded-action squashing used in continuous-control reinforcement learning.

Worked example: linear regression as MLE

Linear regression is usually introduced as "minimise the sum of squared residuals". Likelihood thinking lets us derive that loss rather than just postulate it. Suppose we observe pairs $(\mathbf{x}_i, y_i)$ and model $$y_i = \mathbf{w}^\top \mathbf{x}_i + \epsilon_i, \qquad \epsilon_i \sim \mathcal{N}(0, \sigma^2),$$ with the noise terms independent across observations. Conditional on $\mathbf{x}_i$, the response $y_i$ is therefore $\mathcal{N}(\mathbf{w}^\top \mathbf{x}_i, \sigma^2)$. The likelihood is the product of Gaussian densities, $$L(\mathbf{w}, \sigma^2) = \prod_i \frac{1}{\sqrt{2\pi\sigma^2}} \exp\!\left( -\frac{(y_i - \mathbf{w}^\top \mathbf{x}_i)^2}{2\sigma^2} \right),$$ and the negative log-likelihood, dropping additive constants that do not depend on $\mathbf{w}$, is $$-\ell(\mathbf{w}) = \frac{1}{2\sigma^2}\sum_i (y_i - \mathbf{w}^\top \mathbf{x}_i)^2 + \text{const}.$$

Minimising this in $\mathbf{w}$ is ordinary least-squares regression. The factor $1/(2\sigma^2)$ is just a positive scalar; it does not affect the location of the minimum. So OLS is exactly MLE under the assumption that errors are independent, mean-zero and Gaussian with constant variance. The square in "least squares" comes from the square in the Gaussian density, which itself comes from the second-order Taylor expansion that makes the Gaussian the maximum-entropy distribution with a given mean and variance. The aesthetic chain runs: Gaussian noise $\Rightarrow$ squared loss $\Rightarrow$ OLS.

The same trick goes further. If you suspect heavy-tailed noise, swap the Gaussian likelihood for a Student-$t$ or Laplace likelihood; you obtain robust regression and the absolute-error (median) regression respectively. If you want quantile regression, use an asymmetric Laplace likelihood. If you want heteroscedastic regression where the noise variance depends on $\mathbf{x}$, predict $\log\sigma(\mathbf{x})$ alongside the mean and use the full Gaussian NLL. Every regression loss you have ever seen is a likelihood under some assumption about the noise.

For the homoscedastic Gaussian case, the OLS objective has a closed-form solution: $\hat{\mathbf{w}} = (X^\top X)^{-1} X^\top \mathbf{y}$, where $X$ is the design matrix. That this minimiser exists in closed form is a happy accident of the Gaussian. For most other likelihoods we must solve the optimisation numerically. The closed form is also a numerical hazard: forming $X^\top X$ explicitly squares the condition number of $X$, so practitioners typically prefer to solve the system via QR decomposition or, for very large or sparse problems, by gradient descent, which is itself just iteratively reducing the negative log-likelihood.

Worked example: logistic regression as MLE

The classification analogue uses a Bernoulli rather than a Gaussian likelihood. Each $y_i \in \{0, 1\}$ is modelled as Bernoulli with success probability $$P(y_i = 1 \mid \mathbf{x}_i) = \sigma(\mathbf{w}^\top \mathbf{x}_i), \qquad \sigma(z) = \frac{1}{1 + e^{-z}}.$$

The sigmoid squashes the real-valued score $\mathbf{w}^\top \mathbf{x}_i$ into a probability. The Bernoulli log-likelihood for a single observation is $y_i \log p_i + (1-y_i)\log(1-p_i)$, where $p_i = \sigma(\mathbf{w}^\top \mathbf{x}_i)$. Summing over the dataset and negating gives $$-\ell(\mathbf{w}) = -\sum_i \big[ y_i \log \sigma(z_i) + (1-y_i) \log(1 - \sigma(z_i)) \big],$$ with $z_i = \mathbf{w}^\top \mathbf{x}_i$. This expression is binary cross-entropy. Logistic regression and binary cross-entropy minimisation are not "similar approaches": they are exactly the same computation. The same equivalence extends to multi-class problems by replacing the sigmoid with the softmax and the Bernoulli with a categorical distribution; the resulting NLL is multi-class cross-entropy, and the resulting model is multinomial logistic regression. Inside every classifier with a softmax output and a cross-entropy loss, an MLE for a categorical likelihood is being computed.

Unlike the Gaussian case, there is no closed-form solution for the logistic-regression MLE. The log-likelihood is concave (good: there is a unique maximum) but the optimum has no analytic expression. We solve numerically, typically by Newton–Raphson, which for GLMs takes the form known as iteratively reweighted least squares, or, when the parameter count or dataset size is large, by stochastic gradient descent and its variants.

The connection generalises further still. Replace the linear scorer $\mathbf{w}^\top \mathbf{x}$ with a deep neural network $f(\mathbf{x}; \theta)$ and you obtain a deep classifier trained by cross-entropy. Replace the Gaussian conditional mean with a deep network and you obtain a neural regression model trained by MSE. Replace the categorical with a Gaussian over the noise injected at each timestep of a forward diffusion process and you obtain the simplified DDPM training loss. Modern model design is, to an unsettling degree, just likelihood design.

A useful diagnostic, when you meet a new loss in a paper, is to ask: what is the implied probability model? If the loss is a sum of squared errors, the model is Gaussian with constant variance. If the loss is a sum of absolute errors, the model is Laplace. If the loss is a hinge, you are no longer doing MLE in the strict sense (the hinge is not a log-likelihood for any distribution) but a closely related M-estimation problem. Reading losses through this lens makes a great deal of the deep-learning literature less mysterious.

Properties of MLE

Why is MLE the default? Because, under mild regularity conditions, it has a list of large-sample properties no other estimator can beat:

Consistent: $\hat\theta_{\text{MLE}} \to \theta_0$ in probability as $n \to \infty$. Given enough data, the MLE converges to the truth.
Asymptotically efficient: among all unbiased estimators, the MLE achieves the smallest possible variance, the Cramér–Rao lower bound. No other estimator squeezes more information out of the data in the limit.
Asymptotically normal: $\sqrt{n}(\hat\theta_{\text{MLE}} - \theta_0) \to \mathcal{N}(0, I(\theta_0)^{-1})$, where $I(\theta_0)$ is the Fisher information. This gives us standard errors and confidence intervals essentially for free.
Invariant under reparameterisation: if $\hat\theta$ is the MLE of $\theta$, then $g(\hat\theta)$ is the MLE of $g(\theta)$ for any function $g$. Want the MLE of the variance? Square the MLE of the standard deviation. Want the MLE of the odds? Apply $\theta/(1-\theta)$ to the MLE of the probability.

These guarantees explain why MLE is the gold standard for large samples. The asymptotic normality result is the basis of the Wald confidence intervals and Wald tests reported by virtually every statistical software package. The efficiency property is what makes deep-learning practitioners reach for likelihood-based losses without thinking twice.

A caveat: these are asymptotic properties. They describe what happens as $n$ grows. For small samples and high-dimensional models, the regime in which much of modern machine learning operates, the asymptotics may not have kicked in, and other approaches (regularisation, Bayesian inference, cross-validated model selection) may do better in practice. We will see this caveat sharpen in §5.6 and again in §5.16 on the bias–variance tradeoff.

Limitations

For all its strengths, MLE has well-known failure modes worth flagging.

Overfitting in small samples. Bernoulli with zero successes in ten trials gives $\hat\theta = 0$, predicting that future successes are impossible. Multinomial counts with a zero in some cell give an MLE that assigns zero probability to that outcome, catastrophic for an n-gram language model, which would assign probability zero to any sentence containing an unseen word pair. Smoothing (add-one, add-$\alpha$, Kneser–Ney) is the practical fix; from a likelihood perspective these are just MAP estimates under a Dirichlet prior.
Model misspecification. If the assumed family does not contain the true distribution, the MLE converges not to the truth (which lives outside the family) but to the member of the family closest in Kullback–Leibler divergence to the truth. The estimator may be consistent for a parameter, but the parameter no longer means what we want it to. Every parametric model is an approximation, so this caveat is universal. The sandwich estimator of the asymptotic variance is the standard correction.
Non-convex log-likelihoods. For neural networks and mixture models the log-likelihood is not concave; it has many local maxima, saddle points and plateaus. We never find the global MLE; we find a local stationary point and hope it generalises. Modern deep learning lives entirely in this regime, and a great deal of empirical lore concerns which local optima are reachable and useful.
No uncertainty quantification. The MLE is a single point, not a distribution. The Fisher-information-based standard errors give one-sigma error bars under asymptotic normality, but they tell us nothing about parameters that are weakly identified or about the joint shape of plausible parameter regions for small samples. §5.6 turns to the Bayesian framework, which addresses this directly by treating the parameter itself as a random variable.

What you should take away

MLE picks the parameters under which the observed data are most probable. That single principle generates the workhorse estimator of modern machine learning.
We always optimise the log-likelihood, equivalently the negative log-likelihood. Logs prevent underflow, simplify derivatives and preserve the optimum; minimising NLL and maximising likelihood are the same problem.
Almost every standard ML loss is a negative log-likelihood. MSE is Gaussian NLL with fixed variance; binary cross-entropy is Bernoulli NLL; categorical cross-entropy is multinomial NLL. Choosing a loss is choosing a noise model.
OLS is MLE under Gaussian errors; logistic regression is MLE under a Bernoulli model. These are not loose analogies but exact identities, and they generalise from linear models to deep networks unchanged.
MLE is asymptotically optimal, not unconditionally optimal. It is consistent, efficient and normally distributed in large samples, but small-sample bias, model misspecification, non-convex landscapes and the absence of uncertainty quantification motivate the regularised, Bayesian and resampling-based alternatives covered in the rest of the chapter.