Bayes' theorem relates the posterior probability $P(H \mid E)$ of a hypothesis $H$ given evidence $E$ to the likelihood $P(E \mid H)$ and the prior $P(H)$:
$$P(H \mid E) = \frac{P(E \mid H) \, P(H)}{P(E)}$$
where the marginal $P(E) = \sum_{H'} P(E \mid H') P(H')$ normalises the posterior to sum to one. The theorem follows directly from the definition of conditional probability $P(A \mid B) = P(A, B) / P(B)$ applied symmetrically.
Stated for parameters $\theta$ given data $\mathcal{D}$:
$$P(\theta \mid \mathcal{D}) = \frac{P(\mathcal{D} \mid \theta) \, P(\theta)}{P(\mathcal{D})} \propto P(\mathcal{D} \mid \theta) \, P(\theta).$$
The proportionality is sufficient for many purposes, finding $\theta$ that maximises the posterior (MAP estimation), drawing samples from it (MCMC, variational inference), or marginalising over it.
The maximum likelihood estimator $\hat\theta_\mathrm{MLE} = \arg\max_\theta P(\mathcal{D} \mid \theta)$ ignores the prior; the maximum a posteriori estimator $\hat\theta_\mathrm{MAP} = \arg\max_\theta P(\mathcal{D} \mid \theta) P(\theta)$ includes it. Many regularisation schemes correspond to specific priors: $L^2$ regularisation is a Gaussian prior, $L^1$ is Laplace, dropout is approximate Bayesian inference (Gal & Ghahramani 2016).
Bayes' theorem is the foundation of Bayesian networks, probabilistic graphical models, Bayesian neural networks, Bayesian optimisation for hyperparameter search, and the Bayes-optimal classifier that minimises expected error.
In medical AI specifically, Bayes' theorem governs the relationship between disease prevalence, test sensitivity/specificity, and posterior probability of disease given a test result, a calculation in which physicians (and laypeople) routinely err.
Interactive
Video
Related terms: Maximum Likelihood Estimation, MAP Estimation, Bayesian Network, judea-pearl
Discussed in:
- Chapter 4: Probability, Probability