- State the axioms of probability and use them to compute the probability of compound and conditional events
- Apply Bayes' theorem to update beliefs in light of new evidence
- Recognise the key discrete and continuous distributions used in AI (Bernoulli, binomial, normal, categorical)
- Compute the expectation and variance of random variables and explain their role in model training
- Measure information and uncertainty using entropy, cross-entropy, and KL divergence
A doctor reads a positive test result and tells the patient they probably have the disease. But the maths says otherwise — with a rare disease and an imperfect test, most positive results are false alarms. Getting this wrong can ruin lives. Probability theory is the tool that gets it right.
AI systems face the same challenge every day. A spam filter must decide if an email is junk. A self-driving car must judge whether a shape ahead is a pedestrian. A language model must pick the most likely next word. None of these systems have perfect information. Probability gives them a principled way to handle that uncertainty.
This chapter builds probability from the ground up. You will start with the basic rules, then learn Bayes' theorem — the single most important formula in applied AI. From there, you will meet the probability distributions that appear throughout machine learning, learn how expectation and variance summarise them, and finish with information theory — the framework behind the loss functions that train neural networks. For deeper coverage, see Bishop Bishop, 2006, Murphy Murphy, 2022, and MacKay MacKay, 2003.
4.1 Probability Basics
Probability assigns a number to how likely an event is. The modern framework comes from Andrey Kolmogorov (1933) Kolmogorov, 1933 and rests on three axioms:
- The probability of any event A is non-negative: P(A) ≥ 0.
- The probability of the entire sample space Ω is one: P(Ω) = 1.
- For mutually exclusive events, the probability of their union equals the sum of their individual probabilities: P(A
1∪ A2∪ …) = ΣiP(Ai).
From these three rules, you can derive everything else. The complement rule says P(A^c^) = 1 − P(A). The inclusion–exclusion principle handles overlapping events. All of elementary probability flows from these axioms.
Two Ways to Think About Probability
There are two competing interpretations of what these numbers mean.
The frequentist view ties probability to long-run frequency. Saying a fair coin has P(heads) = 0.5 means that over millions of tosses, roughly half will be heads. This works well for repeatable experiments. But it struggles with one-off events — what is the "long-run frequency" of a specific meteorite hitting Earth?
The Bayesian view treats probability as a degree of belief. P(A) = 0.5 means you are equally uncertain about A happening or not. This interpretation handles unique events naturally. It also provides a clear framework for updating beliefs as new evidence arrives. Most modern AI systems use Bayesian reasoning in some form.
Random Variables
A random variable bridges the gap between abstract events and the numbers that algorithms work with.
A discrete random variable X takes values from a countable set. Each value has a probability mass function (PMF): p(x) = P(X = x). Think of rolling a die — each face has probability 1/6.
A continuous random variable takes values from a range of real numbers. It is described by a probability density function (PDF) f(x). The probability of falling in an interval is P(a ≤ X ≤ b) = ∫a^b^ f(x) dx. Think of measuring someone's height — it can be any value within a range.
Both types share a cumulative distribution function (CDF): F(x) = P(X ≤ x). Random variables are everywhere in machine learning. Features, labels, model parameters, and loss values are all modelled this way.
Independence
Two events A and B are independent if P(A ∩ B) = P(A) × P(B). In plain terms: knowing that B happened tells you nothing new about A.
For random variables, independence means the joint distribution equals the product of the individual distributions. Many machine learning algorithms assume that training examples are independent and identically distributed (i.i.d.) — each drawn separately from the same unknown distribution. This assumption is rarely perfect, but it makes the maths tractable.
When independence fails — as in time series or social networks — you need more sophisticated tools. Markov chains and graphical models handle these cases.
Conditional Probability
Conditional probability appears in almost every AI algorithm. It answers the question: given that B happened, how likely is A?
The definition is: P(A | B) = P(A ∩ B) / P(B), provided P(B) > 0.
This simple ratio captures the core operation of learning from data. You start with a belief about A. You observe evidence B. You update to get a new belief P(A | B). Every supervised learning algorithm estimates some conditional distribution — the probability of a label given features, or the probability of the next word given context.
4.2 Bayes' Theorem
Bayes' theorem follows directly from the definition of conditional probability. For events A and B with P(B) > 0:
P(A | B) = P(B | A) × P(A) / P(B)
Each term has a name:
- Prior P(A): your belief before seeing data.
- Likelihood P(B | A): how probable the data is under hypothesis A.
- Evidence P(B): the total probability of the data across all hypotheses.
- Posterior P(A | B): your updated belief after seeing the data.
The theorem gives you a precise recipe for updating beliefs. It is optimal in a decision-theoretic sense.
The Medical Screening Example
Suppose a disease affects 1 in 1,000 people. A test has 99% sensitivity (catches 99% of true cases) and 95% specificity (correctly clears 95% of healthy people). You test positive. What is the probability you actually have the disease?
Bayes' theorem gives roughly 0.019 — less than 2%. This surprises most people. The reason is the low base rate. Out of 1,000 people tested, about 1 truly has the disease and 50 healthy people get false positives. The true case is buried among the false alarms.
This is the base-rate fallacy. Ignoring it leads to badly calibrated predictions. Spam filters, fraud detectors, and medical AI tools must all account for it.
Bayesian Parameter Estimation
In machine learning, you apply Bayes' theorem to model parameters, not just discrete hypotheses. Given parameters θ and observed data D:
P(θ | D) = P(D | θ) × P(θ) / P(D)
The prior P(θ) encodes your assumptions — for example, that parameters should be small. The likelihood P(D | θ) measures how well the parameters explain the data. The posterior P(θ | D) combines both.
Full Bayesian inference computes the entire posterior distribution. This gives you uncertainty estimates, not just a single best guess. A Bayesian model can report credible intervals and flag inputs where it is unsure. This is especially valuable in safety-critical domains like autonomous driving and clinical decision support.
Approximate Inference
Computing the posterior exactly requires solving P(D) = ∫ P(D | θ) P(θ) dθ. This integral often has no closed-form solution. Two families of methods handle this:
- MCMC (Markov Chain Monte Carlo): draws samples from the posterior. Methods like Metropolis–Hastings and Hamiltonian Monte Carlo are exact in the limit but computationally expensive.
- Variational inference: turns the problem into an optimisation task. You find the closest simple distribution (e.g., a factorised Gaussian) to the true posterior, measured by KL divergence. Variational autoencoders Kingma, 2013 use exactly this approach.
Bishop Bishop, 2006 and Murphy Murphy, 2022 cover both methods in depth.
Priors, Bias, and Variance
The choice of prior connects directly to the bias–variance trade-off (Chapter 5). A strong prior concentrates the posterior tightly. This reduces variance but risks bias if the prior is wrong. A weak prior lets the data dominate. This reduces bias but increases variance, especially with little data.
Here is a revealing connection: L2 regularisation (weight decay) is equivalent to the maximum a posteriori (MAP) estimate under a Gaussian prior on the parameters. Even "frequentist" regularisation has a Bayesian interpretation. This shows the unity of probabilistic thinking across different traditions.
4.3 Probability Distributions
A probability distribution describes how probability is spread across the possible values of a random variable. Choosing the right distribution family is one of the most important modelling decisions. It encodes your assumptions about the data.
Discrete Distributions
The Bernoulli distribution is the simplest. It models a single yes/no outcome with success probability p: P(X = 1) = p and P(X = 0) = 1 − p. Think of a single coin flip.
The Binomial distribution counts successes in n independent Bernoulli trials: P(X = k) = C(n, k) p^k^ (1 − p)^n − k^. Think of flipping a coin 100 times and counting heads.
The Categorical distribution extends the Bernoulli to K outcomes, each with its own probability. Classification networks use it (via softmax) to output a probability for each class.
The Multinomial distribution generalises the Categorical to multiple draws. It appears in topic models and bag-of-words representations.
The Poisson distribution models the number of events in a fixed time interval, parameterised by rate λ. It appears in anomaly detection and queueing models.
The Gaussian (Normal) Distribution
The Gaussian is arguably the most important distribution in all of science. A univariate Gaussian with mean μ and variance σ^2^ has density:
f(x) = (1 / √(2πσ^2^)) exp(−(x − μ)^2^ / (2σ^2^))
Why is it so common? The Central Limit Theorem (CLT) says that the average of many independent random variables tends toward a Gaussian, regardless of their original distributions. Many real-world measurements — heights, errors, short-term financial returns — arise from the sum of many small effects. So the Gaussian fits them well.
In machine learning, Gaussian assumptions appear in linear regression, Gaussian processes, variational autoencoders, and diffusion-based generative models.
The Multivariate Gaussian
The multivariate Gaussian extends this to vectors. It is parameterised by a mean vector μ and a covariance matrix Σ. Its density involves the inverse and determinant of Σ, connecting directly to the linear algebra of Chapter 2.
The geometry is elegant. The contours of a multivariate Gaussian form an ellipsoid. Its axes are the eigenvectors of Σ. Its radii are proportional to the square roots of the eigenvalues. This is a concrete example of linear algebra and probability working together.
Gaussian mixture models (GMMs) combine several Gaussians to approximate complex shapes. They are the basis of classical clustering algorithms and generative classifiers.
Other Continuous Distributions
Several more distributions appear frequently in AI:
- Uniform: equal density across an interval. Often used as a non-informative prior.
- Exponential: models waiting times between events. It is memoryless — the probability of waiting another t seconds does not depend on how long you have already waited.
- Beta: defined on [0, 1]. It is the conjugate prior for the Bernoulli. Widely used in Bayesian A/B testing.
- Dirichlet: generalises the Beta to probability simplices. It is the prior in latent Dirichlet allocation (LDA) for topic modelling.
- Gamma and Inverse-Gamma: conjugate priors for Poisson rates and Gaussian variances.
Knowing these families helps you choose the right assumptions for your model — and recognise when those assumptions might be wrong.
The Exponential Family
Many of these distributions belong to a single unifying class: the exponential family. Its density has the form:
p(x | η) = h(x) exp(η^T^ T(x) − A(η))
Here η is the natural parameter, T(x) is the sufficient statistic, and A(η) is the log-partition function. The Bernoulli, Gaussian, Poisson, Exponential, Beta, Gamma, and Dirichlet are all members.
Exponential-family distributions have useful properties:
- They have conjugate priors, making Bayesian updates easy.
- Their maximum-likelihood estimates depend only on sufficient statistics.
- Generalised linear models (GLMs) pair an exponential-family response with a linear predictor via a link function.
This framework gives you a compact way to understand a large chunk of probabilistic modelling.
4.4 Expectation & Variance
A full probability distribution contains complete information. But you often need a quick summary. The two most fundamental summaries are the expectation (mean) and the variance.
Expectation
The expectation of a discrete random variable is E[X] = Σx x · p(x). For a continuous variable, it is E[X] = ∫ x f(x) dx.
Think of the expectation as the "centre of mass" of the distribution. If you repeated an experiment forever and averaged the results, you would converge to E[X]. This is the law of large numbers.
In machine learning, loss functions are defined as expectations. The expected risk is E[L(Y, f(X))]. Training aims to minimise this quantity.
Linearity of Expectation
Expectation is a linear operator. For any random variables X and Y and constants a, b:
E[aX + bY] = a E[X] + b E[Y]
This holds whether or not X and Y are independent. You can break complex expressions into simple pieces. For example, the expected value of the sum of n variables, each with mean μ, is simply nμ. Linearity also explains why the sample mean is an unbiased estimator of the population mean — a cornerstone of statistics (Chapter 5).
Variance
Variance measures how spread out a distribution is around its mean:
Var(X) = E[(X − E[X])^2^] = E[X^2^] − (E[X])^2^
The second form is often easier to compute. Unlike expectation, variance is not linear: Var(aX + b) = a^2^ Var(X). For independent variables, variances add: Var(X + Y) = Var(X) + Var(Y).
The standard deviation σ = √Var(X) shares the units of X and is easier to interpret. In AI, high variance across training sets means the model is unstable and likely overfitting. Chapter 5 formalises this through the bias–variance decomposition.
Higher-Order Moments
Beyond mean and variance, other summaries capture finer details:
- Skewness (third standardised moment): measures asymmetry. Positive skew means a long right tail.
- Kurtosis (fourth standardised moment): measures tail heaviness relative to the Gaussian.
- Moment-generating function (MGF): M(t) = E[e^tX^]. When it exists, it encodes all moments and uniquely identifies the distribution.
These matter in practice. Heavy-tailed gradient noise in stochastic gradient descent can slow training. Gradient clipping and adaptive optimisers address this problem.
Covariance and Correlation
The covariance between two variables measures their linear association: Cov(X, Y) = E[(X − E[X])(Y − E[Y])]. The normalised version is the Pearson correlation ρ = Cov(X, Y) / (σX σY), which lies in [−1, 1] and is scale-invariant.
For a random vector X = (X1, …, Xd)^T^, the covariance matrix Σ has entries Σij = Cov(Xi, Xj). It is symmetric and positive semi-definite. Principal component analysis (PCA) works by eigendecomposing this matrix to find directions of maximal variance — a direct application of the spectral theory from Chapter 2.
Laws of Total Expectation and Variance
Two powerful tools let you compute moments by conditioning:
- Law of total expectation: E[X] = E[E[X | Y]]
- Law of total variance: Var(X) = E[Var(X | Y)] + Var(E[X | Y])
The variance law splits total variance into two parts. The first is the average "within-group" spread. The second is the "between-group" variation.
These laws are essential for hierarchical and mixture models, where data is generated in two stages — first draw a latent variable, then draw the observation conditionally. The variational autoencoder's evidence lower bound (ELBO) is derived using exactly these identities.
4.5 Joint & Conditional Probability
Most AI problems involve multiple random variables — features, labels, latent states, and parameters. Understanding how they relate requires joint and conditional distributions.
Joint Distributions
The joint distribution of two discrete variables X and Y is specified by p(x, y) = P(X = x, Y = y) for all pairs. For continuous variables, the joint PDF f(x, y) gives P((X, Y) ∈ A) = ∫∫A f(x, y) dx dy.
The joint distribution contains everything about the relationship between the variables: their individual behaviours, their dependencies, and how one behaves given the other.
Marginalisation
You recover the distribution of a single variable by summing (or integrating) out the others:
- Discrete: p(x) = Σ
yp(x, y) - Continuous: f(x) = ∫ f(x, y) dy
This is called marginalisation. It is one of the two fundamental operations of probabilistic inference (the other is conditioning).
In graphical models, marginalisation means eliminating variables. Algorithms like belief propagation and the junction-tree algorithm exploit graph structure to do this efficiently. In Bayesian model selection, the marginal likelihood P(D) = ∫ P(D | θ) P(θ) dθ penalises overly complex models that spread their probability mass too thinly.
The Chain Rule
The conditional distribution of Y given X = x is: p(y | x) = p(x, y) / p(x). Rearranging gives the chain rule of probability:
p(x, y) = p(y | x) p(x)
This extends to any number of variables:
p(x1, …, xn) = p(x1) ∏i=2^n^ p(xi | x1, …, xi−1)
Modern large language models use this directly. They factorise the joint distribution of a token sequence into a product of conditionals. Each conditional is modelled by a neural network. The chain rule provides the theoretical foundation for the transformer's left-to-right text generation.
Conditional Independence
Variables X and Y are conditionally independent given Z, written X ⊥ Y | Z, if:
p(x, y | z) = p(x | z) p(y | z) for all x, y, z
Conditional independence does not imply marginal independence, nor vice versa.
A naïve Bayes classifier assumes all features are conditionally independent given the class label. This cuts the number of parameters from exponential to linear in the number of features. Despite this strong simplification, naïve Bayes works remarkably well for text classification and spam filtering. Useful predictions do not always require perfectly accurate models.
Graphical Models
Graphical models encode conditional independence assumptions visually.
In a Bayesian network (directed graph), each node is a random variable. Edges encode direct dependencies. Missing edges imply conditional independence (formalised by d-separation). The joint factorises as:
p(x1, …, xn) = ∏i p(xi | parents(xi))
Hidden Markov models, Kalman filters, and variational autoencoders are all examples.
In an undirected model (Markov random field), the joint is specified by potential functions over cliques, up to a normalising constant. Conditional independences follow from graph separation. Conditional random fields, widely used in sequence labelling for NLP, are a prominent example.
Three Operations on Joints
Many AI algorithms perform one of three basic operations on a joint distribution:
- Conditioning: incorporating observed evidence
- Marginalisation: summing out nuisance variables
- Maximisation: finding the most probable configuration
The Expectation–Maximisation (EM) algorithm uses all three. The E-step conditions on latent variables given current parameters. The M-step maximises the expected complete-data log-likelihood. Fluency with joint and conditional distributions is essential for designing and understanding learning algorithms.
4.6 Information Theory
Claude Shannon founded information theory in 1948 Shannon, 1948. It provides a mathematical framework for quantifying information, uncertainty, and communication cost. MacKay MacKay, 2003 gives an especially clear treatment of its connections to machine learning.
The core concepts — entropy, cross-entropy, and divergence — have become essential in AI. They underpin loss functions, model comparison, and representation learning. The deep link to probability is natural: both fields ask how to reason about uncertain quantities.
Entropy
The entropy of a discrete random variable X with PMF p is:
H(X) = −Σx p(x) log p(x)
The logarithm is usually base 2 (bits) or base e (nats). Entropy measures the average "surprise" in a distribution.
- A uniform distribution over K outcomes has entropy log K — maximum uncertainty.
- A distribution concentrated on one outcome has entropy zero — no uncertainty.
For continuous variables, differential entropy h(X) = −∫ f(x) log f(x) dx plays an analogous role. It can be negative and lacks some properties of discrete entropy. Among all continuous distributions with a given mean and variance, the Gaussian has the maximum differential entropy. This gives another justification for the Gaussian assumption in maximum-entropy modelling.
Cross-Entropy
The cross-entropy between a true distribution p and a model distribution q is:
H(p, q) = −Σx p(x) log q(x)
It measures the average number of bits needed to encode data from p using a code optimised for q. In machine learning, you set p to the empirical label distribution and q to the model's predictions. Minimising cross-entropy is the same as maximising log-likelihood.
The softmax cross-entropy loss used to train neural network classifiers is exactly this quantity, averaged over the training set.
KL Divergence
The Kullback–Leibler (KL) divergence measures the extra cost of using distribution q when the true distribution is p:
DKL(p || q) = Σx p(x) log(p(x) / q(x))
Key properties:
- Always non-negative (Gibbs' inequality).
- Equals zero if and only if p = q.
- Not a true distance metric — it is asymmetric and does not satisfy the triangle inequality.
The relationship H(p, q) = H(p) + DKL(p || q) shows that minimising cross-entropy is the same as minimising KL divergence, since H(p) is constant with respect to the model.
KL divergence is the foundational measure of distributional discrepancy. It appears in variational inference, generative adversarial training, and policy optimisation in reinforcement learning.
Mutual Information
Mutual information measures how much knowing Y reduces your uncertainty about X:
I(X; Y) = DKL(p(x, y) || p(x) p(y)) = H(X) − H(X | Y)
Key properties:
- Captures all forms of statistical dependence, not just linear relationships (unlike correlation).
- Symmetric: I(X; Y) = I(Y; X).
- Non-negative. Zero if and only if X and Y are independent.
Mutual information appears in feature selection (picking features most informative about the target), the information bottleneck method for representation learning, and analysis of deep learning representations. Estimating it in high dimensions remains an open challenge.
The data-processing inequality states that post-processing cannot increase mutual information. This places a fundamental limit on what any learning algorithm can extract from data.
Information Geometry
Information theory also connects to the geometry of probability distributions. The Fisher information matrix — whose entries are expectations of second derivatives of the log-likelihood — defines a Riemannian metric on the space of distributions.
Natural gradient descent pre-multiplies the ordinary gradient by the inverse Fisher information. This follows the steepest-descent direction in information-geometric terms. It has been shown to improve convergence in training neural networks and in policy-gradient methods for reinforcement learning.
The connections between entropy, divergence, and geometry show that information theory is not just a collection of useful formulas. It is a coherent lens for understanding the entire enterprise of learning from data.