Bayesian inference treats parameters and hypotheses as random variables. Given a prior $p(\theta)$ and likelihood $p(\mathcal{D} | \theta)$, the posterior combines them via Bayes' theorem:
$$p(\theta | \mathcal{D}) = \frac{p(\mathcal{D} | \theta) \, p(\theta)}{p(\mathcal{D})}$$
The posterior is the basis for all subsequent inference: MAP estimation $\hat \theta = \arg\max p(\theta | \mathcal{D})$, posterior mean $\bar \theta = \mathbb{E}[\theta | \mathcal{D}]$, credible intervals, predictive distributions $p(x_\mathrm{new} | \mathcal{D}) = \int p(x_\mathrm{new} | \theta) p(\theta | \mathcal{D}) d\theta$.
Conjugate priors give closed-form posteriors: Beta-Bernoulli, Gaussian-Gaussian, Dirichlet-multinomial, Normal-Inverse-Wishart. Otherwise, MCMC or variational inference are required.
Bayesian deep learning: place a prior on neural network weights, approximate the posterior. Methods: variational Bayesian neural networks, MC dropout, deep ensembles (an implicit Bayesian average), Stein variational gradient descent. Practical Bayesian deep learning remains an open challenge, exact Bayesian neural networks are intractable; existing approximations have known weaknesses.
Model selection: Bayesian model averaging weighs models by their marginal likelihood $p(\mathcal{D} | \mathcal{M}) = \int p(\mathcal{D} | \theta, \mathcal{M}) p(\theta | \mathcal{M}) d\theta$. The Bayesian Information Criterion (BIC) is a Laplace approximation to the log marginal likelihood.
Bayesian inference contrasts with frequentist statistics (which treats parameters as fixed unknowns). Both have applications; Bayesian methods excel when prior information is available and when uncertainty quantification matters; frequentist methods excel for hypothesis testing with strict error guarantees.
Video
Related terms: Bayes' Theorem, MAP Estimation, MCMC, Variational Inference
Discussed in:
- Chapter 4: Probability, Probability