MAP Estimation, Glossary, Textbook of AI

Maximum a posteriori (MAP) estimation is a point-estimation technique that chooses the parameter value maximising the posterior distribution given observed data. It sits between maximum likelihood estimation (MLE), which uses no prior, and full Bayesian inference, which keeps the entire posterior.

Definition

Given data $\mathcal{D}$, parameters $\theta$, likelihood $p(\mathcal{D} \mid \theta)$, and prior $p(\theta)$, Bayes' rule gives the posterior

$$p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta) \, p(\theta)}{p(\mathcal{D})}.$$

The MAP estimate is its mode:

$$\hat\theta_{\mathrm{MAP}} = \arg\max_\theta\, p(\theta \mid \mathcal{D}) = \arg\max_\theta\, p(\mathcal{D} \mid \theta) \, p(\theta).$$

The marginal likelihood $p(\mathcal{D})$ is constant in $\theta$ and so does not affect the optimisation. Equivalently, one minimises the negative log posterior

$$-\log p(\mathcal{D} \mid \theta) - \log p(\theta),$$

which decomposes into a data-fit term plus a regularisation term, the prior contributes a penalty on parameter values.

Priors as regularisers

Many regularised ML objectives are MAP estimates under specific priors:

Gaussian prior $\theta \sim \mathcal{N}(0, \sigma^2 I)$: $-\log p(\theta) = \frac{1}{2\sigma^2}\|\theta\|_2^2 + \text{const}$, equivalent to L2 (ridge) regularisation.
Laplace prior $\theta \sim \text{Laplace}(0, b)$: $-\log p(\theta) = \frac{1}{b}\|\theta\|_1 + \text{const}$, equivalent to L1 (lasso) regularisation, which encourages sparsity.
Spike-and-slab priors induce explicit sparsity; Dirichlet priors smooth multinomial estimates.

Weight decay in deep learning is exactly a Gaussian prior on weights; lasso regression is a Laplace prior; the smoothing in naive Bayes (add-one or add-$\alpha$) is a Dirichlet prior on the multinomial probabilities.

MAP versus MLE

MAP reduces to MLE under a uniform (improper) prior $p(\theta) \propto 1$. With informative priors, especially in the small-data regime , MAP regularises against overfitting and can incorporate domain knowledge (e.g. that effects are likely to be small, or that probabilities should be smoothed away from zero).

MAP versus full Bayesian inference

MAP returns a point estimate; full Bayesian inference produces a distribution over $\theta$ and propagates uncertainty into predictions via the posterior predictive $p(y^* \mid x^*, \mathcal{D}) = \int p(y^* \mid x^*, \theta) p(\theta \mid \mathcal{D}) \, d\theta$. MAP is computationally cheap (one optimisation, e.g. by gradient descent or coordinate descent); full Bayesian inference requires Markov chain Monte Carlo or variational methods.

Limitations

MAP has well-known shortcomings:

No uncertainty quantification, a single point hides whether the posterior is sharply peaked or near-flat.
Not reparameterisation-invariant, the mode of $\theta$ is not, in general, the transformed mode of $g(\theta)$. Means and full posteriors are; modes are not.
Poor representative of skewed or multimodal posteriors, the mode of a long-tailed distribution can be a region of negligible probability mass.

These limitations motivate moving up the ladder of Bayesian approximations: from MAP to Laplace approximation (a Gaussian centred at the MAP), to variational inference, to full MCMC.

Related terms: Regularisation, Bayesian Inference

Discussed in:

Chapter 6: ML Fundamentals, Bayesian Estimation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).