Probability: 4.12   Maximum likelihood and Bayesian inference (preview)

Dr Chris Paton

4.12 Maximum likelihood and Bayesian inference (preview)

Almost every model in modern AI begins with the same question: given the data we have observed, what should we believe about the parameters of our model? A neural network has millions or billions of weights; a logistic regression has a handful of coefficients; a Gaussian has a mean and a variance. In each case the data are fixed (we have already collected them) and the parameters are unknown (we want to estimate them). Inference is the discipline of turning data into beliefs about parameters.

There are two dominant ways of doing this, and they answer the question in different idioms. Maximum likelihood estimation (MLE), the workhorse of frequentist statistics, asks: which parameter value would make the data we observed most plausible? It returns a single best guess. Bayesian inference asks: given a prior belief about the parameters and the evidence in the data, what is the full posterior distribution over the parameters? It returns a distribution, not a point. Most modern AI systems use MLE-style point estimation because it is cheap and scales to billions of parameters; Bayesian methods appear when we need calibrated uncertainty, generative modelling, or probabilistic programming.

This section is a preview. We met Bayes' rule already in §4.3 as a way of updating beliefs about events. Here we promote it from a rule about events to a framework for inference about parameters. Chapter 5 develops both paradigms in full statistical detail; this section gives you the conceptual scaffolding and one fully worked example so the chapter-5 material lands on prepared ground.

Symbols Used Here

$\theta$model parameters

$\mathcal{D} = \{(x_i)\}$observed data

$p(\mathcal{D} \mid \theta)$likelihood

$p(\theta)$prior

$p(\theta \mid \mathcal{D})$posterior

$\hat\theta_{\text{MLE}}$maximum-likelihood estimate

$\hat\theta_{\text{MAP}}$maximum a posteriori estimate

Maximum likelihood (MLE)

Suppose we have a model with parameters $\theta$ that assigns a probability $p(x \mid \theta)$ to every possible observation $x$. If we have collected independent and identically distributed data $\mathcal{D} = \{x_1, x_2, \ldots, x_n\}$, the joint probability of the entire dataset under the model is the product $$p(\mathcal{D} \mid \theta) = \prod_{i=1}^{n} p(x_i \mid \theta).$$ Read as a function of $\theta$ with $\mathcal{D}$ fixed, this is called the likelihood. The maximum-likelihood estimate is the parameter value that makes the likelihood as large as possible, $$\hat\theta_{\text{MLE}} = \arg\max_\theta p(\mathcal{D} \mid \theta) = \arg\max_\theta \sum_{i=1}^{n} \log p(x_i \mid \theta).$$ Because the logarithm is monotonic, maximising the log-likelihood is equivalent to maximising the likelihood, and the sum form is easier to differentiate than the product form. In practice we either solve the equation $\nabla_\theta \log p(\mathcal{D} \mid \theta) = 0$ analytically when the model is simple, or run gradient ascent (equivalently, gradient descent on the negative log-likelihood) when it is not. Training a neural network with cross-entropy loss is exactly maximum-likelihood estimation in disguise: the cross-entropy loss is the negative log-likelihood of a categorical model.

Worked example: Bernoulli MLE. A coin lands heads with unknown probability $\theta$. We toss it $n$ times and observe $k$ heads. Each toss is independent, so the joint probability of the observed sequence is the product of $\theta$ for every head and $(1-\theta)$ for every tail, which gives the likelihood $\theta^k (1-\theta)^{n-k}$. The log-likelihood is therefore $$\ell(\theta) = k \log \theta + (n-k) \log(1-\theta).$$ Differentiating and setting to zero, $$\frac{d\ell}{d\theta} = \frac{k}{\theta} - \frac{n-k}{1-\theta} = 0,$$ which rearranges to $\hat\theta_{\text{MLE}} = k/n$. The MLE is the empirical fraction of heads, the most natural estimate you could imagine. This is reassuring: when the model is right and we have enough data, MLE recovers what common sense would suggest. It is also illustrative of a wider pattern. The Gaussian MLE for the mean is the sample mean $\bar{x}$; the Poisson MLE for the rate is the sample mean of the counts; the categorical MLE for each class probability is the empirical class frequency. Maximum likelihood, applied to simple models, recovers the empirical statistics that practitioners would write down by intuition. Where MLE earns its keep is in models too complex for intuition, neural networks, hidden Markov models, mixture models, where the same recipe still applies but the maximisation now needs a numerical optimiser.

Maximum a posteriori (MAP)

MLE has a well-known weakness on small samples: if you toss the coin three times and see three heads, the MLE is $\hat\theta = 1$, asserting that the coin will never land tails again. The Bayesian fix is to encode prior knowledge about $\theta$ in a prior distribution $p(\theta)$ and combine it with the likelihood through Bayes' rule: $$p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta) p(\theta)}{p(\mathcal{D})} \propto p(\mathcal{D} \mid \theta) p(\theta).$$ The maximum a posteriori estimate is the mode of the posterior, the parameter value that maximises the product of likelihood and prior, $$\hat\theta_{\text{MAP}} = \arg\max_\theta \log p(\mathcal{D} \mid \theta) + \log p(\theta).$$ Compared with MLE, the only change is the extra $\log p(\theta)$ term, which acts as a regulariser pulling the estimate towards values the prior considers plausible.

Worked example: Bernoulli with a Beta prior. Suppose we put a Beta(2, 2) prior on $\theta$. The Beta family is conjugate to the Bernoulli, so the posterior after $k$ heads in $n$ tosses is Beta($k+2$, $n-k+2$), whose mode is $$\hat\theta_{\text{MAP}} = \frac{k+1}{n+2}.$$ The prior has effectively added one pseudo-success and one pseudo-failure to the data. With three heads in three tosses we now estimate $\hat\theta_{\text{MAP}} = 4/5$ instead of the MLE's $1$. The estimate is smoothed: it never collapses to $0$ or $1$, no matter how small the sample. This is exactly the Laplace-smoothing trick that naive Bayes classifiers use when they encounter a word they have never seen.

MAP and L2 regularisation. Place a Gaussian prior $\mathcal{N}(0, 1/\lambda)$ on the weights of a linear regression. Then $$\log p(\theta) = -\frac{\lambda}{2} \|\mathbf{w}\|^2 + \text{const},$$ and the MAP objective becomes the log-likelihood minus $\frac{\lambda}{2}\|\mathbf{w}\|^2$. We have just derived L2 regularisation (ridge regression) from a Bayesian prior. Likewise, an L1 penalty (lasso) corresponds to a Laplace prior, which has heavier tails near zero and so encourages sparsity by leaving many weights at exactly zero. Many of the regularisers practitioners reach for instinctively are MAP estimates with a particular prior, the prior is the regulariser, and the regulariser is the prior, viewed from two sides. This dual view is more than aesthetic. It tells you, when you choose a regularisation strength $\lambda$, that you are implicitly making a statement about how big you expect the weights to be a priori; it tells you, when a Bayesian colleague hands you a prior, how to translate it into the deterministic optimisation that your gradient-descent code already understands.

Full Bayesian inference

MAP gives back a single point estimate and is in this sense only half-Bayesian. Full Bayesian inference keeps the entire posterior $p(\theta \mid \mathcal{D})$ and propagates it into predictions. To predict a new observation $y_{\text{new}}$ at input $\mathbf{x}_{\text{new}}$, we average the likelihood over the posterior: $$p(y_{\text{new}} \mid \mathbf{x}_{\text{new}}, \mathcal{D}) = \int p(y_{\text{new}} \mid \mathbf{x}_{\text{new}}, \theta)\, p(\theta \mid \mathcal{D})\, d\theta.$$ This is the posterior predictive distribution. It says: rather than commit to the single best $\theta$, weight every possible $\theta$ by its posterior probability and combine the predictions. The result is automatically broader than any plug-in prediction would be, because it accounts for our uncertainty about $\theta$ itself. When the data are abundant the posterior concentrates around $\hat\theta$ and the predictive collapses to the plug-in answer; when the data are scarce the predictive is appropriately humble.

The integral above is rarely tractable. Two families of approximation dominate practice. Variational inference replaces the true posterior with a tractable family $q_\phi(\theta)$, typically a factorised Gaussian, and minimises the KL divergence $\mathrm{KL}(q_\phi \,\|\, p(\theta \mid \mathcal{D}))$, which is equivalent to maximising the evidence lower bound (ELBO). It is fast and amenable to gradient-based optimisation, and the same machinery powers the variational autoencoder you will meet later in the book. The price is bias: the chosen family rarely matches the true posterior exactly, and a factorised Gaussian in particular tends to underestimate posterior variance and miss multimodality. Markov chain Monte Carlo (MCMC) instead constructs a Markov chain whose stationary distribution is the true posterior and draws samples from it. Hamiltonian Monte Carlo and the No-U-Turn Sampler (NUTS) are the workhorses; they are unbiased in the limit but slow, and diagnosing convergence in high dimensions is itself a subtle craft. A third route, Laplace approximation, fits a Gaussian to the posterior at its MAP mode using the Hessian of the log-posterior, fast, local, and a natural bridge between MAP and full Bayes that often suffices for moderately sized models.

The Bayesian payoff is calibrated uncertainty: a 90% posterior credible interval really does contain the true parameter with probability 90% if the prior and likelihood are correct. This is qualitatively different from a frequentist confidence interval, which makes a statement about the long-run frequency of a procedure rather than about the parameter itself. The cost is computational. Doing full Bayesian inference on a billion-parameter neural network is, today, beyond reach, which is why most large models are trained by MLE or MAP. Where full Bayesian inference shines is in problems with modest parameter counts and high stakes for getting the uncertainty right, clinical trial analysis, scientific measurement, A/B testing, small-sample social science, and the latent-variable models inside generative pipelines. Probabilistic programming languages such as Stan, PyMC and Pyro have made these workflows much more accessible than they were a decade ago, and the field is gradually pushing the practical frontier of "model size at which full Bayes is feasible" upward each year.

MLE vs MAP vs full Bayesian

Property	MLE	MAP	Full Bayesian
Output	Single $\hat\theta$	Single $\hat\theta$	Posterior $p(\theta\\|\mathcal{D})$
Uses prior?	No	Yes	Yes
Computes uncertainty?	No (point estimate)	No	Yes
Cost	Cheap	Cheap	Expensive

In practice, modern deep learning sits firmly in the MLE/MAP column. Standard cross-entropy training is MLE; weight decay turns it into MAP with a Gaussian prior. Bayesian neural networks, networks whose weights have full posterior distributions, typically approximated by mean-field variational inference, are an active research area but are not yet standard in production. For practical uncertainty quantification, engineers reach instead for cheaper proxies: deep ensembles (train several networks from different initialisations and average their predictions), Monte Carlo dropout (interpret dropout at test time as approximate Bayesian sampling), and conformal prediction (a distribution-free wrapper that converts any point predictor into calibrated prediction sets). These give much of the benefit of full Bayesian inference without the prohibitive integration cost.

A useful mental model: MLE is what you do when you have plenty of data and trust it; MAP is what you do when you have less data and want to inject prior knowledge gracefully; full Bayesian is what you do when you genuinely need to know how uncertain you are, for example, when the cost of a confidently wrong prediction is high, as in medical imaging or autonomous driving. The three approaches are not rival philosophies so much as a sliding scale of how much you are willing to spend, in compute and in modelling effort, for a calibrated account of what you do not know. As you move down the column from MLE to MAP to full Bayes, you trade speed for fidelity; as you move up, you trade fidelity for speed. Most projects do not pick one and stick with it forever, they begin with MLE for prototyping, add a regulariser (silently doing MAP) when overfitting bites, and reach for proper posterior tools only on the components where calibrated uncertainty is genuinely required.

Why log-likelihood, not likelihood

Three reasons, in increasing order of subtlety. First, numerical stability. The likelihood of a dataset is a product of many small probabilities, and a product of a thousand numbers each around $0.01$ underflows to zero in floating-point arithmetic long before you finish multiplying. A 32-bit float can represent numbers down to about $10^{-38}$, so even a few hundred independent observations can crash the likelihood through the floor. Logarithms convert the product into a sum, and sums of moderately negative numbers stay representable for as long as you care to keep adding. Second, computational convenience. Differentiating a sum is term-by-term and easy; differentiating a product brings in the product rule and a forest of cross terms that grow quadratically with the number of factors. Gradient-based optimisers, the backbone of all modern training, work directly on log-likelihoods, and reverse-mode automatic differentiation is happiest with sums of log-densities. Third, statistical equivalence. The logarithm is monotonically increasing, so the $\theta$ that maximises $p(\mathcal{D} \mid \theta)$ is exactly the $\theta$ that maximises $\log p(\mathcal{D} \mid \theta)$. Nothing is lost by working on the log scale. A fourth bonus: many distributions belong to the exponential family, whose densities factor as $\exp(\eta(\theta)^\top T(x) - A(\theta))$. On the log scale, exponential-family densities become linear in the sufficient statistics, which is why so much of classical statistics has clean closed-form solutions and why the gradients of the log-likelihood for these distributions take such tidy forms.

What you should take away

Inference asks: given the data, what should I believe about the parameters? MLE answers with the value that makes the data most likely; Bayes answers with a full posterior distribution.
The Bernoulli MLE for $k$ heads in $n$ tosses is $\hat\theta_{\text{MLE}} = k/n$. The MAP with a Beta(2, 2) prior is $(k+1)/(n+2)$, smoothed away from the boundaries.
L2 weight decay is MAP with a Gaussian prior; L1 lasso is MAP with a Laplace prior. Most regularisers are priors in disguise.
Full Bayesian inference predicts by integrating over the posterior. The integral is usually approximated by variational inference or MCMC; deep ensembles, MC dropout, and conformal prediction are practical substitutes.
We optimise log-likelihoods, not likelihoods, for numerical stability, ease of differentiation, and statistical equivalence, and exponential-family distributions reward us with linearity in their sufficient statistics.