Statistics: 5.6   MAP and Bayesian Inference

Dr Chris Paton

5.6 MAP and Bayesian Inference

The previous section showed how maximum likelihood estimation (MLE) chooses the parameter value that makes the observed data most plausible. That is a powerful idea, but it has a peculiar feature: it pretends we know nothing at all about the parameter before the data arrive. If you flip a coin ten times and it lands heads zero times, MLE confidently announces that the probability of heads is exactly zero, that the coin will never, in any future toss, land heads. Anyone who has ever held a coin knows this is silly. We had a sense, before tossing, that the parameter was probably somewhere near a half. MLE has no way to use that sense.

Bayesian inference fixes this by writing the prior belief down explicitly. Where MLE picks the parameter most consistent with data alone, the maximum a posteriori (MAP) estimator and full Bayesian inference combine prior knowledge with data and report what the combination implies. Two practical consequences follow. First, estimates become smoother on small samples, because the prior keeps them from collapsing onto extreme values. Second, the answer is no longer a single number but a distribution, so we can quote uncertainty alongside the estimate without a separate ritual.

§5.7 uses the posterior distributions built here to construct credible intervals; §5.12 leans on the same Bayesian machinery for hierarchical models.

Symbols Used Here

$\theta$parameters

$p(\theta)$prior

$p(\mathcal{D} \mid \theta)$likelihood

$p(\theta \mid \mathcal{D})$posterior

$\hat\theta_{\text{MAP}}$MAP estimate

From MLE to MAP

Bayes' theorem, written for parameters and data, is

$$p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta)\, p(\theta).$$

The left-hand side is the posterior, what we believe about $\theta$ after seeing the data. The first term on the right is the likelihood, the same object MLE optimises. The second term is the prior, which encodes what we believed before any data arrived. The constant of proportionality is the marginal likelihood $p(\mathcal{D})$, which does not depend on $\theta$ and so does not change where the posterior peaks.

MAP picks the value of $\theta$ at which the posterior is highest. Because the logarithm is monotonic and turns products into sums, this is

$$\hat\theta_{\text{MAP}} = \arg\max_\theta\; \log p(\mathcal{D} \mid \theta) + \log p(\theta).$$

Compare this to MLE, which keeps only the first term. The second term is a steering wheel. If the prior is uniform over a wide range, meaning every value of $\theta$ in that range is equally plausible before the data, then $\log p(\theta)$ is a constant, drops out of the optimisation, and MAP and MLE give exactly the same answer. The moment the prior favours some values over others, MAP starts to pull the estimate towards those favoured regions.

That pull is what people mean by regularisation. A prior that prefers small parameter values pulls MAP estimates towards zero. A prior that prefers parameters near a previous study's answer pulls MAP estimates towards that study. The strength of the pull is proportional to how peaked the prior is and inversely proportional to how much data we have. With a tonne of data, the likelihood overwhelms the prior and MAP looks like MLE. With very little data, the prior dominates and MAP looks like the prior's mode. This automatic balancing is one of the most useful features of Bayesian thinking.

Worked: Bernoulli with Beta prior

Coin flips, button clicks, treatment successes, anything that is either yes or no, follow a Bernoulli distribution with a single parameter $\theta$, the probability of "yes". The natural prior for $\theta$, which lives between zero and one, is the Beta distribution. A Beta($\alpha, \beta$) prior has two shape parameters that act as pseudo-counts: $\alpha$ imaginary successes and $\beta$ imaginary failures observed before any real data. Beta(1, 1) is flat, the uniform distribution on $[0, 1]$. Beta(2, 2) is gently humped near a half. Beta(50, 30) is sharply concentrated around $0.625$ and represents strong prior belief.

The reason this prior is so popular is that it pairs cleanly with the Bernoulli likelihood. After observing $k$ successes in $n$ trials, the posterior is

$$p(\theta \mid \mathcal{D}) = \operatorname{Beta}(\alpha + k,\; \beta + n - k).$$

The pseudo-counts and the real counts simply add. This is the cleanest example of conjugacy, prior and posterior in the same family, and it lets us update belief by adding numbers rather than computing integrals.

The mode of a Beta($a, b$) distribution, when both $a$ and $b$ exceed one, is $(a - 1) / (a + b - 2)$. Substituting our updated parameters gives the MAP estimate

$$\hat\theta_{\text{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}.$$

With the gently informative Beta(2, 2) prior, this collapses to $(k + 1) / (n + 2)$, a formula known since Laplace as the rule of succession.

Now consider the unsettling case from the opening of this section: zero successes in ten trials. MLE gives $\hat\theta_{\text{MLE}} = 0/10 = 0$, the embarrassing claim that future tosses can never succeed. With a Beta(2, 2) prior, MAP gives

$$\hat\theta_{\text{MAP}} = \frac{0 + 1}{10 + 2} = \frac{1}{12} \approx 0.083.$$

The prior smooths the estimate away from the boundary. We have not pretended the data say something they do not, ten failures genuinely suggest a small probability, but we have refused to commit to literal impossibility on the basis of ten observations. Run another ten trials and observe one success, and the MAP estimate will move further; run a thousand trials and observe none, and MAP will be almost indistinguishable from zero, because the data have finally become numerous enough to overwhelm the prior. This is exactly the behaviour we want: gentle when evidence is thin, faithful when evidence is thick.

The same trick rescues n-gram language models. If a particular word never appeared after "the quick" in your training corpus, MLE would assign it zero probability, and a single such word in the test set would make the perplexity of the entire model infinite. Adding one phantom occurrence of every word, additive smoothing, is precisely the MAP estimate under a Dirichlet(1, 1, …, 1) prior. The fix that engineers reached for on practical grounds turns out to be Bayes in disguise.

Conjugate priors

A prior is conjugate to a likelihood when the resulting posterior belongs to the same family as the prior. Beta-Bernoulli is the prototypical example. A few others appear repeatedly across machine learning:

Beta-Bernoulli, for binary outcomes, as above.
Dirichlet-Categorical, the multivariate generalisation, used wherever we need a probability distribution over $K$ discrete outcomes. The smoothing in n-gram language models and the topic-word distributions in Latent Dirichlet Allocation are both Dirichlet-Categorical updates.
Normal-Normal, a Gaussian prior on the mean of a Gaussian likelihood gives a Gaussian posterior. The posterior mean is a precision-weighted average of the prior mean and the data mean, which is the cleanest possible illustration of how Bayes balances prior and evidence.
Gamma-Poisson, a Gamma prior on a Poisson rate gives a Gamma posterior. Used for count data: hospital admissions, web requests per second, click-through rates.
Normal-Inverse-Wishart, for Gaussian data with both unknown mean and unknown covariance. Underlies Bayesian Gaussian discriminant analysis and many Bayesian mixture models.

For decades these conjugate pairs were not just convenient: they were the only inferences anyone could compute. The integrals required for non-conjugate Bayesian inference were prohibitive on pre-1990s hardware, and the catalogue of conjugate priors was effectively the catalogue of feasible Bayesian models. The arrival of Markov chain Monte Carlo (MCMC) in the 1990s and variational inference shortly after freed the field from this constraint, and modern Bayesian deep learning rarely uses conjugate priors at all. The conjugate cases remain valuable as building blocks, sanity checks and pedagogical tools, but the wider Bayesian world is no longer limited to them.

Full Bayesian inference

MAP keeps only the mode of the posterior, one number out of the entire distribution. Full Bayesian inference keeps the whole thing and uses it whenever a prediction is required. The key object is the posterior predictive distribution:

$$p(y_{\text{new}} \mid \mathbf{x}_{\text{new}}, \mathcal{D}) = \int p(y_{\text{new}} \mid \mathbf{x}_{\text{new}}, \theta)\, p(\theta \mid \mathcal{D})\, d\theta.$$

Read the right-hand side from inside out. For each value of $\theta$, the model under that $\theta$ assigns a probability to the new outcome. We then average those probabilities, weighting each by how plausible $\theta$ is given the data we have already seen. Predictions made this way automatically widen when the parameters are uncertain and tighten when the data have pinned the parameters down.

The trouble is that the integral in this expression is rarely something we can compute by hand. For non-conjugate priors there is no closed form, and for high-dimensional models, neural networks especially, even numerical integration is hopeless. Three families of approximation dominate practice.

Variational inference replaces the true posterior with a member of a tractable family (often a product of Gaussians) chosen to be as close to the true posterior as possible in KL-divergence. The integral becomes a manageable optimisation problem and is the engine behind variational autoencoders and Bayesian neural networks trained with stochastic variational inference.

Markov chain Monte Carlo sidesteps the integral entirely by drawing samples from the posterior. Approximate any expectation under the posterior by averaging the same quantity over the samples. Hamiltonian Monte Carlo and the No-U-Turn Sampler power Stan and PyMC; Gibbs sampling drove the first generation of LDA topic models. MCMC is slow but, given enough time, asymptotically exact.

Laplace approximation fits a Gaussian to the posterior centred at the MAP estimate, with covariance set by the inverse Hessian of the negative log-posterior. It is fast, often surprisingly accurate near the mode, and forms the basis of several recent neural-network uncertainty methods.

The choice between these is a practical engineering trade-off, but the conceptual picture is the same: keep the whole posterior, integrate over it when you predict, and accept that the integral will usually be approximate.

Regularisation as MAP

One of the cleanest results in modern machine learning is that nearly every regulariser practitioners reach for turns out to be a MAP estimate under some prior.

L2 regularisation, which adds $\lambda \|\theta\|^2$ to the loss, is exactly MAP under a zero-mean Gaussian prior $\theta \sim \mathcal{N}(0, 1/(2\lambda))$. The penalty for large weights is the negative log of a Gaussian; the strength of the penalty controls the prior variance. Take the limit $\lambda \to 0$ and the prior becomes flat and MAP becomes MLE.

L1 regularisation, the $\lambda \|\theta\|_1$ penalty that drives Lasso and many sparse models, is MAP under a Laplace (double-exponential) prior. The Laplace distribution has a sharper peak at zero than a Gaussian, which is why L1 produces estimates that are exactly zero rather than merely small.

Dropout, which randomly zeros activations during training, was reinterpreted by Gal and Ghahramani in 2016 as approximate Bayesian model averaging over a set of thinned subnetworks. Running dropout at test time and averaging the predictions becomes a Monte Carlo estimate of the posterior predictive.

Early stopping, halting training before convergence, has no such clean Bayesian reading, but it behaves empirically like an implicit regulariser, keeping weights closer to their initialisation and so closer to the prior.

The unification matters because it gives every penalty a probabilistic interpretation, and every Bayesian model an optimisation interpretation. When practitioners argue about whether to add an L2 penalty, they are arguing about whether they believe parameters are a priori small. When they choose Lasso over Ridge, they are choosing a heavier-tailed prior. This duality is one of the conceptual victories of Bayesian thinking in modern machine learning.

Bayesian deep learning

A Bayesian neural network places a prior over the weights, typically an isotropic Gaussian, and approximates the posterior using one of the methods from the previous subsection. The output is no longer a point prediction but a distribution over predictions, and from that distribution we can read off useful properties.

Three applications stand out. Active learning asks which input, if labelled, would most reduce posterior uncertainty; this is straightforward to compute given a posterior and impossible without one. Out-of-distribution detection uses the posterior variance as a signal: a Bayesian model presented with an input unlike anything in training tends to give widely varying predictions across the posterior, an effective "I don't know" that point estimates cannot express. Continual learning uses the posterior after one task as the prior for the next, a principled way to remember without storing old data.

Computational cost remains the central challenge. Full posterior inference over millions of weights is hard, and several cheaper alternatives have emerged: deep ensembles (train many networks from different initialisations and treat them as posterior samples), Monte Carlo dropout (use dropout at test time), and conformal prediction (a non-Bayesian framework that nonetheless gives calibrated prediction intervals). None of these is fully Bayesian, but each captures part of what a posterior would have given us, and together they make uncertainty-aware deep learning practical today.

What you should take away

Bayes' rule combines a prior with a likelihood to produce a posterior; MAP picks the posterior's mode; full Bayesian inference keeps the entire posterior and integrates over it when predicting.
With a uniform prior, MAP equals MLE; with any informative prior, MAP regularises towards the prior's mode, gently when data are plentiful and strongly when they are sparse.
The Beta-Bernoulli case rescues MLE from absurd estimates: zero successes in ten trials gives MLE = 0 but MAP with a Beta(2, 2) prior gives 1/12, a sensible smoothed value.
Conjugate priors give closed-form posteriors and were historically essential; today MCMC, variational inference and Laplace approximation let us do Bayesian inference for almost any model.
L2 and L1 regularisation, dropout and many other deep-learning tricks are MAP estimates or approximate Bayesian inference under specific priors, Bayes is hiding inside most of the machinery you already use.