Probability: 4.3   Bayes' theorem in depth

Dr Chris Paton

4.3 Bayes' theorem in depth

Bayesian update, prior to posterior · 1:18. A prior meets the likelihood and the posterior emerges between them. Open transcript and references →

Bayes' theorem is the mathematical rule for changing your mind. You start with some belief about how the world is. Then you see something, a test result, a sensor reading, a word in an email. Bayes' theorem tells you, exactly and unambiguously, what your new belief should be once that fresh information has been folded in. It is the single most important formula in probabilistic AI. Medical diagnosis runs on it. Spam filters run on it. Self-driving cars use it dozens of times a second to fuse cameras, radar and lidar into a single picture of the road. Modern Bayesian deep learning, where neural networks output not just a guess but a calibrated measure of their own uncertainty, is built on it from the ground up.

Bayes' theorem is a short algebraic rearrangement of the conditional-probability formula, and yet it carries enormous practical weight because it lets us reverse the direction of conditioning. Often we know how likely the data is given a hypothesis, but what we actually want is how likely the hypothesis is given the data. Bayes turns one into the other.

Symbols Used Here

$P(A \mid B)$conditional probability of $A$ given $B$

$P(A, B) = P(A \cap B)$joint probability of $A$ and $B$

$H$hypothesis (e.g. "patient has the disease")

$D$data (e.g. "the test came back positive")

$\sum_H$sum over all possible hypotheses

The theorem

We begin with the definition of conditional probability from §4.2. For any two events $A$ and $B$ with non-zero probability, $$P(A \mid B) = \frac{P(A, B)}{P(B)}, \qquad P(B \mid A) = \frac{P(A, B)}{P(A)}.$$ The left equation says: the chance of $A$ given that $B$ has happened equals the chance of both happening, divided by the chance of $B$ alone. The right equation says the same thing with the roles swapped. Notice that the joint probability $P(A, B)$, the probability that both events occur, appears on the right-hand side of both equations. That is the key. Multiplying out, we have $P(A, B) = P(A \mid B) P(B)$ and equally $P(A, B) = P(B \mid A) P(A)$. The two right-hand sides must be equal, because they are both equal to the same joint probability: $$P(A \mid B) P(B) = P(B \mid A) P(A).$$ Dividing through by $P(B)$ gives the result: $$P(A \mid B) = \frac{P(B \mid A) P(A)}{P(B)}.$$ This is Bayes' theorem. It was first written down by the Reverend Thomas Bayes, an English Presbyterian minister and amateur mathematician, in an essay that his friend Richard Price found among his papers and submitted to the Royal Society in 1763, two years after Bayes' death. The form we use today, and the recognition of the theorem as a general principle of inference rather than a curiosity about billiard tables, is largely due to Pierre-Simon Laplace, who rediscovered it independently and applied it to astronomy, demography and jurisprudence in his Théorie analytique des probabilités of 1812.

The algebra takes one line. The implications fill libraries. The reason is that the formula does something strange: it lets us swap the direction of an arrow. Most of the time, science gives us a forward model, given a cause, what effects do we expect? But the questions we actually want to answer run backwards. Given the effects we have observed, what cause is most likely? Bayes' theorem is the bridge between these two directions, and the rest of this section is about reading what the bridge says.

Posterior, likelihood, prior, evidence

In nearly every AI application, the two events have specific roles. One is a hypothesis $H$, a proposition about the world whose truth we want to assess. The other is some data $D$, an observation we have actually made. Setting $A = H$ and $B = D$ in the formula gives $$P(H \mid D) = \frac{P(D \mid H) P(H)}{P(D)}.$$ Each of the four quantities in this equation has a name, and learning the names is half the battle. Once you can read the formula in words rather than symbols, the rest follows naturally.

The quantity on the left, $P(H \mid D)$, is called the posterior. It is what you believe about the hypothesis after you have seen the data. This is almost always what you actually want. A radiologist wants to know how likely cancer is given the scan, not how likely the scan is given cancer. An email client wants to know how likely a message is spam given its words, not how likely those words are given that it is spam.

The first factor on the right, $P(D \mid H)$, is called the likelihood. It tells us how plausible the observed data is if the hypothesis were true. Crucially, the likelihood is a function of $H$ for fixed observed $D$, and it does not have to integrate to one when viewed that way: it is not a probability distribution over hypotheses. Forward models in physics, biology and engineering, the things scientists usually build, give us likelihoods.

The second factor, $P(H)$, is the prior. It is your belief about the hypothesis before you saw any data. Priors come from background knowledge, previous experiments, or in some cases simple frequencies in the population. Beginners often find priors uncomfortable, because they look subjective. The trick is to remember that not having a prior is impossible: refusing to choose one just hides the choice you have already made.

The denominator, $P(D)$, is called the evidence or the marginal likelihood. It is the probability of the data averaged across all possible hypotheses, $P(D) = \sum_H P(D \mid H) P(H)$. For binary problems this sum has just two terms, one for $H$ and one for "not $H$". The evidence acts as a normaliser: it is the constant that ensures the posterior probabilities sum to one. In many practical settings we never actually compute $P(D)$; we simply note that the posterior is proportional to likelihood times prior, $P(H \mid D) \propto P(D \mid H) P(H)$, and recover the constant at the end.

In words, then, Bayes' theorem reads: posterior is proportional to likelihood times prior. New belief equals old belief, multiplied by how well the data fits.

Worked example: medical testing

The single best way to see Bayes' theorem in action is the medical-screening example. The numbers are simple but the answer is famously counterintuitive, and every student of probabilistic AI should work through it at least once.

Suppose a disease has a prevalence of one in a thousand in the relevant population, so $P(\text{disease}) = 0.001$. A screening test for it is highly accurate. Its sensitivity, the probability that it correctly flags a sick patient, is $P(\text{+ test} \mid \text{disease}) = 0.99$. Its specificity, the probability that it correctly clears a healthy patient, is $P(\text{- test} \mid \text{no disease}) = 0.95$, so its false-positive rate is $0.05$.

You take the test. It comes back positive. What is the probability that you actually have the disease, that is, what is $P(\text{disease} \mid \text{+})$?

A natural first guess, based on the headline accuracy figures, is somewhere around 95 to 99 per cent. That guess is dramatically wrong. Bayes' theorem tells us so: $$P(\text{disease} \mid \text{+}) = \frac{P(\text{+} \mid \text{disease}) P(\text{disease})}{P(\text{+})}.$$ We know the numerator: $0.99 \times 0.001 = 0.00099$. We need the denominator, the marginal probability of a positive test result regardless of disease status. Splitting on disease and no-disease, $$P(\text{+}) = P(\text{+} \mid \text{disease}) P(\text{disease}) + P(\text{+} \mid \text{no disease}) P(\text{no disease})$$ $$= 0.99 \cdot 0.001 + 0.05 \cdot 0.999 = 0.00099 + 0.04995 = 0.05094.$$ Therefore $$P(\text{disease} \mid \text{+}) = \frac{0.99 \cdot 0.001}{0.05094} \approx 0.0194.$$ Less than two per cent. After a positive result from a test that is right ninety-nine per cent of the time on the sick and ninety-five per cent of the time on the healthy, the probability that you actually have the disease is still only about one in fifty.

The reason becomes obvious as soon as you visualise it. Imagine ten thousand people taking the test. About ten of them have the disease, and the test correctly flags roughly all ten. The other 9,990 are healthy, but five per cent of them, about 500 people, get a false positive. In the room of everyone who tested positive, you find ten true cases and five hundred false alarms, so the chance that any given positive is real is about $10 / 510 \approx 0.0196$. Bayes' theorem is doing the same arithmetic as the room-counting, just with symbols.

The lesson is the base-rate fallacy: when an event is genuinely rare, even a very accurate test produces more false positives than true positives, because the false-positive rate operates on a much larger group. Ignoring the prior (the base rate of one in a thousand) gives nonsense. Including it gives the right answer. Every AI practitioner working with screening, fraud detection, anomaly detection, or any rare-event classifier needs to keep this example in mind. Headline accuracy figures, treated in isolation, mislead.

Sequential application

Bayes' theorem can be applied repeatedly. After you have updated your belief once on the basis of $D_1$, your posterior $P(H \mid D_1)$ becomes your new prior, and another piece of data $D_2$ is folded in by another application of the rule: $$P(H \mid D_1, D_2) = \frac{P(D_2 \mid H, D_1) P(H \mid D_1)}{P(D_2 \mid D_1)}.$$ If $D_1$ and $D_2$ are conditionally independent given $H$, that is, if knowing the hypothesis explains away any relationship between the two observations, this collapses to the more digestible $$P(H \mid D_1, D_2) \propto P(D_2 \mid H) P(H \mid D_1).$$ Each new observation simply multiplies the running posterior by the corresponding likelihood. Continuing the medical example: suppose a second, independent test is run on a fresh sample, and it too comes back positive. The posterior after the first positive was about $0.0194$, with an odds ratio of roughly $1$ to $50$. A second positive multiplies the odds in favour of disease by the test's likelihood ratio, pushing the posterior up sharply. Two positives convert one-in-fifty into something close to one-in-three; a third positive, on the same numbers, would push it past two-in-three.

This sequential structure is the mathematical basis for cascade or staged testing in medicine, and for the layered classifiers used throughout AI. A cheap, broad screening test runs first; only patients (or emails, or transactions) flagged by the first stage are passed to a more expensive, more specific second test; only those flagged by both go to a third. Each stage multiplies the previous odds by its own likelihood ratio, so two moderately informative tests in sequence can be far more conclusive than either alone.

Bayes in machine learning

Bayes' theorem is not just for diagnostic puzzles. It pervades machine learning. The most direct use is the Naive Bayes classifier, which assumes that the features of an input are conditionally independent given the class. To classify a chest X-ray as pneumonia or not, you write $$P(\text{pneumonia} \mid \mathbf{x}) \propto P(\text{pneumonia}) \prod_i P(x_i \mid \text{pneumonia}),$$ and likewise for the "no pneumonia" class. The class with the higher posterior wins. The independence assumption is almost always false in detail, features in real data are correlated, but Naive Bayes is nevertheless surprisingly accurate on text classification problems and was for years the standard method for spam filtering. Its speed, simplicity and small data requirements still make it a useful baseline.

A more principled use is Bayesian linear regression. Place a Gaussian prior on the weights, $\mathbf{w} \sim \mathcal{N}(0, \alpha^{-1} \mathbf{I})$, and assume Gaussian noise on the targets, $y \mid \mathbf{x}, \mathbf{w} \sim \mathcal{N}(\mathbf{w}^\top \mathbf{x}, \beta^{-1})$. Bayes' theorem then gives a posterior over $\mathbf{w}$ that is itself Gaussian, with parameters that can be written in closed form. The point estimate that maximises this posterior turns out to be exactly ridge regression, which is one reason L2 regularisation can be motivated as a Gaussian prior on the weights. The full posterior, though, gives more than a single weight vector: it gives a distribution over weight vectors, which translates into calibrated uncertainty on predictions.

Bayesian neural networks apply the same idea to deep models. A prior is placed on the millions of weights, the likelihood comes from the network's output layer, and the posterior, over the entire weight space, is the object of interest. Unlike linear regression, this posterior is mathematically intractable, so it has to be approximated. The two main families of approximation are variational inference, which fits a tractable distribution by minimising a divergence to the true posterior, and Markov chain Monte Carlo, which samples from it. These approximate-inference techniques power active learning (where the model picks its own training examples), uncertainty quantification in safety-critical systems, and out-of-distribution detection. They are also the engine behind variational autoencoders, which we will meet in Chapter 14.

These ideas reach far beyond the classroom. Modern self-driving cars maintain what is essentially a giant Bayesian filter over the state of the world: the position and velocity of every other vehicle, pedestrian, and obstacle is a hypothesis, the readings from cameras, lidar, radar and the inertial-measurement unit are data, and at every cycle the on-board computer recomputes the posterior. The Kalman filter, the extended Kalman filter, and the particle filter, which we will meet in the chapter on robotics, are all special cases of repeated Bayesian updating. In natural-language processing, every probabilistic language model, including the large transformer models that have transformed the field, can be read as computing a posterior over the next token given the preceding context. In genomics, Bayesian methods reconstruct phylogenies and call variants. In cosmology, they fit the cosmic microwave background. The same one-line formula recurs in all of these settings because they share the same structure: prior knowledge, an observation model, and a need to update the first using the second.

Common pitfalls

Bayes' theorem is simple to state and easy to misapply. A handful of recurring mistakes account for most of the errors made by practitioners and journalists alike.

The most frequent is ignoring the prior. This is the same error as the base-rate fallacy in §4.3 above. When the underlying event is rare, treating headline test accuracy as the posterior gives wildly inflated estimates and floods the system with false positives. The cure is to insist on knowing the prevalence before drawing any conclusion from a positive result.

A close cousin is conflating likelihood with posterior, sometimes called the prosecutor's fallacy. The probability of the data given the hypothesis, $P(D \mid H)$, is not the same as the probability of the hypothesis given the data, $P(H \mid D)$. A famous courtroom version: a forensic match has probability one in a million under innocence, therefore the suspect is guilty with probability $1 - 10^{-6}$. Wrong: that calculation ignores both the size of the population that could have generated a match and the prior probability of guilt. Bayes' theorem mechanically prevents this confusion if you actually use it.

A third pitfall is using a uniform prior when prior knowledge exists. Saying "I have no opinion, so I will use a flat prior" is itself a strong opinion, it asserts that every hypothesis is equally plausible before any data. Where genuine background information is available, using it is not bias; refusing to use it is.

A fourth is treating improper priors as posteriors. Priors are degrees of belief; they need not always be normalisable, and in some technical settings deliberately are not. Posteriors, by contrast, must be proper probability distributions. Reporting an unnormalised prior as if it were a result of inference is a category error.

What you should take away

Bayes' theorem rearranges the definition of conditional probability to swap the direction of conditioning: $P(H \mid D) = P(D \mid H) P(H) / P(D)$.
The four parts have names worth memorising: posterior on the left, likelihood and prior on the top right, evidence on the bottom right.
Always include the prior. Ignoring it produces the base-rate fallacy, which makes accurate-sounding tests give the wrong answer for rare events.
Bayes is iterative: posteriors become priors when new data arrives, and independent observations multiply their likelihoods together.
From Naive Bayes through Bayesian linear regression to Bayesian deep learning, the same one-line formula underwrites a vast portion of modern probabilistic AI.