Probability: 4.1   Why probability for AI

Dr Chris Paton

4.1 Why probability for AI

Modern AI systems are saturated with uncertainty. A spam classifier rarely returns a hard "yes" or "no"; it returns a probability, say 0.87, that captures both the weight of the evidence and the ambiguity that remains. A self-driving car receives sensor readings blurred by rain on the lens and shadows on the road, and must still decide whether the smudge ahead is a child or a bin bag. A large language model, asked to finish the sentence the cat sat on the…, does not pick a single word; it produces a probability distribution over the entire vocabulary, with mat near the top, floor and roof close behind, and parliament vanishingly small but not impossible.

In every one of these examples the system has to commit to numbers that express how confident it is. Probability is the mathematical language we use to write those numbers down, to combine them sensibly, and to act on them. Without probability, an AI system can guess, but it cannot say how much it is guessing, nor can it tell us when its guess deserves trust.

This chapter sits at a crucial junction. Chapter 2 gave us linear algebra, the language of vectors, matrices and the geometry of high-dimensional spaces. Chapter 3 gave us calculus, the language of change, gradients and optimisation. Together those two tools let us describe models and fit them to data. Chapter 4 adds probability, the language of uncertainty, which lets us quantify what we don't know. Chapter 5 then builds statistical inference on top of probability: how to learn from data that are noisy, finite and never quite enough.

Probability has few rules, and the intuitions arrive early. By the end of this section you will have seen what you need to start reading the rest of the chapter: three axioms, a handful of definitions, and a clear sense of why these objects appear in AI at all.

Symbols Used Here

$P(A)$probability of event $A$

$P(A \cap B)$probability of $A$ and $B$

$P(A \mid B)$probability of $A$ given $B$

$\mathbb{E}[X]$expected value of $X$

$\text{Var}(X)$variance of $X$

$\sim$"is distributed as"

Why probability matters in AI

There are three closely related reasons why probability is unavoidable in modern AI, and it is worth taking each one slowly.

1. Uncertain inputs. The world rarely arrives in tidy form. A microphone picks up not just the speaker's voice but the hum of the air conditioning, the rustle of papers and the cough at the back of the room. A camera sensor reads photon counts that fluctuate even when the scene is perfectly still. A health record is missing the patient's blood pressure on the third visit because the cuff was broken. A piece of text is ambiguous because the writer was tired. In every one of these situations, the input the AI sees is a noisy, incomplete shadow of the underlying truth. Probability lets us say, "given what I have observed, here is a distribution over what the truth might actually be", and that distribution, not a single guess, is the faithful summary of the input.

2. Uncertain models. Even if the inputs were perfectly clean, we would still be uncertain about how the world works. A model trained on a million chest X-rays has not seen every possible patient, every possible scanner setting, every possible disease presentation. The parameters of the model are themselves estimates, fitted from a finite sample, and a different sample would have produced slightly different parameters. Probability lets us reason about this epistemic uncertainty, uncertainty about the model, and distinguish it from the aleatoric uncertainty that arises from genuine randomness in the world. Bayesian inference, which we shall meet in §4.4 and develop more fully in Chapter 5, is precisely the discipline of updating our beliefs about a model in light of new data.

3. Uncertain outputs. Many of the most important applications of AI ask not just for a guess but for a calibrated confidence. Medical diagnosis is the textbook case: a model that says "pneumonia, definitely" and is wrong one time in ten is far more dangerous than a model that says "pneumonia, with probability 0.9" and is wrong one time in ten, because the second model's reported confidence matches its actual error rate, and the doctor can act accordingly. The same is true of fraud detection, where an action threshold has to be tuned against the cost of a false alarm; of autonomous driving, where braking decisions weigh certainty of obstacle against the risk of being rear-ended; and of search engines, where relevance scores rank rather than classify.

A worked example pins this down. Suppose a clinical decision-support tool is shown a chest X-ray and outputs $P(\text{pneumonia} \mid \text{X-ray}) = 0.30$. A blunt yes/no system would have to choose a threshold and report either "no pneumonia" (and be wrong 30 per cent of the time on this kind of case) or "pneumonia" (and be wrong 70 per cent of the time). The probabilistic output, by contrast, lets the doctor combine the model's evidence with the cost of missing pneumonia (a few days of avoidable illness, occasionally death) and the cost of treating it unnecessarily (a course of antibiotics, a small risk of side effects). If treatment is cheap and missing the disease is catastrophic, 0.30 is more than enough to act on. If treatment is expensive and the patient is otherwise well, 0.30 might justify watchful waiting. The number itself drives the decision; collapsing it to a binary throws that information away.

The same logic applies far outside medicine. In a fraud-detection pipeline, the model's output probability is multiplied by the value of the transaction to estimate expected loss, and the threshold for blocking a card is tuned to balance customer inconvenience against bank exposure. In a self-driving stack, the perception module emits a probability over object classes and a covariance over object positions, and the planner integrates those distributions over possible futures to decide whether to brake, steer or coast. In a recommender system, ranked probabilities decide which item heads the list. Whenever an AI system is plugged into a real decision, what flows down the wire is rarely a single answer: it is a distribution.

Two interpretations of probability

There is a long-running, genteel argument among statisticians about what a probability actually is. Two camps dominate.

The frequentist view. A frequentist treats $P(A)$ as the long-run fraction of trials in which event $A$ occurs. The probability of a coin coming up heads is 0.5 because, if we tossed it forever, half the tosses would be heads. The probability of a particular treatment working is the proportion of patients in the population for whom it works. On this view, probabilities are objective features of the world, measurable in principle by repetition, and a probability simply does not exist for events that cannot in principle be repeated.

The Bayesian view. A Bayesian treats $P(A)$ as a degree of belief in $A$, given everything one currently knows. On this view, probability is a property of the reasoner, not of the world; two people with different background knowledge can rationally hold different probabilities for the same event, and a probability can be assigned to one-off events such as "the climate sensitivity of doubled CO$_2$ is greater than 3°C" or "this particular patient has pneumonia." Beliefs are updated by Bayes' theorem as new evidence arrives.

The good news is that the two camps agree about the axioms of probability, the bookkeeping rules for combining and manipulating probabilities, so the calculations are identical. They disagree about what the numbers mean and about which problems probability is allowed to address. Modern AI is happily ecumenical and uses both. Frequentist ideas dominate empirical risk minimisation, hypothesis testing and cross-validation: most loss functions in deep learning are, in disguise, frequentist objects. Bayesian ideas underlie variational inference, Gaussian processes, Markov chain Monte Carlo, posterior predictive checks, and any system that needs to express calibrated uncertainty about its own parameters. A practising machine-learning engineer needs both vocabularies.

Beginners sometimes worry that they have to declare allegiance to one camp or the other before they can start computing. They do not. For most of this chapter the distinction will not matter: the same formulae apply whichever interpretation you prefer. Where the choice does affect the answer, for example, when constructing a confidence interval versus a credible interval, or when reasoning about a one-off event such as the outcome of a particular election, we shall flag it explicitly.

The three axioms (Kolmogorov 1933)

In 1933 the Russian mathematician Andrey Kolmogorov set out three short axioms from which the entire edifice of probability follows. Let $\Omega$ denote the sample space, the set of all possible outcomes of an experiment, and let events be subsets of $\Omega$.

Non-negativity. For any event $A$, $P(A) \ge 0$. Probabilities cannot be negative.
Normalisation. $P(\Omega) = 1$. Something in the sample space must happen.
Additivity. If $A$ and $B$ are disjoint, they cannot both occur, then $P(A \cup B) = P(A) + P(B)$. (More generally, for a countable collection of pairwise-disjoint events, the probability of their union is the sum of their probabilities.)

That is the lot. Everything else in elementary probability is a consequence.

The complement rule $P(A^c) = 1 - P(A)$ follows because $A$ and its complement $A^c$ are disjoint and together make up $\Omega$, so by axioms 2 and 3, $P(A) + P(A^c) = P(\Omega) = 1$. The general union rule $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ comes from decomposing $A \cup B$ into three disjoint pieces, the part of $A$ outside $B$, the part of $B$ outside $A$, and the overlap, and applying axiom 3 carefully so that the overlap is not counted twice. Bounds such as $P(A) \le 1$ follow because $A \subseteq \Omega$.

Three axioms, a few lines of derivation, and we already have most of what is needed to reason consistently about uncertainty. No other rules are required: any system of plausible reasoning that obeys a few intuitive postulates is provably equivalent to probability theory. Reason inconsistently, or reason with probability: there is no third option.

Three short statements — a probability is non-negative, the total probability is one, probabilities of disjoint events add — are enough to underwrite every calculation in this chapter, every loss function in chapters 6 to 11, and every uncertainty quantification later in the book.

What this chapter covers

The remainder of the chapter develops the machinery in stages.

§4.2 introduces the basic vocabulary of probability: sample spaces, events, and Kolmogorov's axioms.
§4.3 explores Bayes' theorem in depth, the engine of belief updating.
§4.4 defines random variables and their probability mass and density functions.
§4.5 introduces the most important named distributions: Bernoulli, binomial, categorical, Poisson, Gaussian, exponential, beta and Dirichlet.
§4.6 develops joint, marginal and conditional distributions, the way several random variables interact.
§4.7 introduces expectation, variance and covariance, the summary statistics that compress distributions to a few interpretable numbers.
§4.8 covers concentration inequalities (Markov, Chebyshev, Hoeffding) that bound how far a random variable can stray.
§4.9 turns to limit theorems, the law of large numbers and the central limit theorem, that explain why averaging works.
§4.10 develops the multivariate Gaussian, the workhorse of high-dimensional modelling.
§4.11 introduces information theory: entropy, cross-entropy, KL divergence and mutual information, which underlie most of the loss functions used in deep learning.
§4.12 previews maximum-likelihood and Bayesian inference, the bridge to chapter 5.
§4.13 considers sampling: how to draw from distributions when closed-form integrals fail.
§4.14 puts it all together in Python, and §4.15 walks through a worked mini-project on calibrating a spam filter.

What you should take away

AI is the science of acting under uncertainty, and probability is the only consistent language for handling that uncertainty.
Three sources of uncertainty, noisy inputs, imperfect models, and the need for calibrated outputs, are why probability cannot be avoided in any serious AI system.
Frequentist and Bayesian interpretations give probabilities different meanings but identical rules; modern machine learning uses both fluently.
Kolmogorov's three axioms, non-negativity, normalisation and additivity, are the foundation from which the complement rule, the union rule and every other elementary identity follow.
This chapter is the bridge from the deterministic mathematics of chapters 2 and 3 to the statistical inference of chapter 5: master the vocabulary here and the rest of the book unfolds naturally.