4.6 Joint, marginal and conditional distributions

So far, we have considered single random variables: one coin, one die, one measurement at a time. Real AI problems are almost never that tidy. A self-driving car looks at hundreds of pixels, a velocity, and a steering angle all at once. A spam filter weighs many words. A medical model considers age, blood pressure, and a dozen lab values together. Whenever we have more than one quantity to reason about, we need a way to talk about how they behave as a group and also how they behave individually. That is what this section is for.

The plan is straightforward. We will start with joint distributions, which describe a whole collection of variables at once. We will then learn two operations that let us simplify a joint into something smaller. Marginalising throws information away on purpose: we forget some variables and keep the distribution of the rest. Conditioning does the opposite of forgetting: we fix the value of some variables and ask what the others look like in that restricted world. Independence and the chain rule fall out of these two ideas. We finish with a small spam classifier that puts everything together.

This section keeps the PMF/PDF machinery from §4.4 and lets it act on several variables at once. The notation looks busier, but nothing genuinely new is being asked of you.

Symbols Used Here
$\mathbf{X} = (X_1, \ldots, X_d)$random vector (a list of variables)
$P(X_1 = x_1, X_2 = x_2)$joint PMF in the discrete case
$p(\mathbf{x})$joint PDF in the continuous case
$p(x_1)$ or $\int p(\mathbf{x}) \, dx_2 \cdots dx_d$marginal of $X_1$
$p(x_1 \mid x_2)$conditional of $X_1$ given $X_2$

Joint distributions

A joint distribution describes the chances of several variables taking particular values together. In the discrete case, we list a probability $P(X = x, Y = y)$ for every pair $(x, y)$ that the variables can take. The list is non-negative everywhere and sums to one over all pairs. That is the only requirement. You can think of the joint PMF as a table: rows for the values of $X$, columns for the values of $Y$, and a number in each cell.

For continuous variables the picture is the same but with an integral sign instead of a sum. Now we have a joint density $p(x, y)$, which is non-negative and integrates to one over the whole plane. Probabilities of regions are recovered by integrating: the probability that $(X, Y)$ falls in a region $A$ is $$ P((X, Y) \in A) = \iint_A p(x, y) \, dx \, dy. $$ The density at a single point is not itself a probability (it is a probability per unit area), but it tells us where the mass piles up.

A small worked example fixes the discrete case. Toss two fair coins. Each coin can land $H$ or $T$, so the joint has four cells: $$ P(H, H) = P(H, T) = P(T, H) = P(T, T) = 0.25. $$ The four numbers add to $1$, as required. Notice we have not assumed any structure beyond fairness; we just listed every outcome.

A small worked example fixes the continuous case. Take a standard bivariate Gaussian $\mathcal{N}(\mathbf{0}, \mathbf{I})$. Its joint density is $$ p(x, y) = \frac{1}{2\pi} \, e^{-(x^2 + y^2)/2}. $$ The density is highest at the origin and falls off in radial circles. To find, say, the probability that both $X$ and $Y$ lie inside the unit square, we would integrate $p(x, y)$ over $[0, 1] \times [0, 1]$; there is no shortcut. The joint is the model.

Two things are worth saying at the start. First, a joint distribution contains everything there is to know about the variables. Marginals, conditionals, correlations, dependencies, all of it can be recovered from the joint. The joint is the most informative object and also the most expensive: a joint over $d$ binary variables has $2^d$ entries. Second, when we speak of "the distribution" of several variables in AI, we almost always mean the joint, even if we end up working with a marginal or a conditional in practice.

Marginalising

Marginalising answers the question: "I have a joint over many variables, but I only care about one of them. What does that one look like on its own?" The recipe is to add up (or integrate) over all the values of the variables we do not care about.

For two discrete variables, $$ P(X = x) = \sum_y P(X = x, Y = y). $$ We are walking along the row of the joint table that corresponds to $X = x$ and totalling the entries. The result is a valid PMF for $X$, sometimes called the marginal PMF.

For continuous variables, $$ p(x) = \int p(x, y) \, dy. $$ The integral plays the role of the sum. The result is the marginal density of $X$. The same idea generalises in the obvious way: to obtain the marginal of $X_1$ from a joint over $d$ variables, integrate (or sum) over $X_2, \ldots, X_d$.

A worked example using our two coins: from the joint $P(H, H) = P(H, T) = 0.25$, etc., the marginal of the first coin is $$ P(X = H) = P(X = H, Y = H) + P(X = H, Y = T) = 0.25 + 0.25 = 0.5. $$ By symmetry $P(X = T) = 0.5$. The marginal of a single fair coin is $\{0.5, 0.5\}$, which is what we knew already, but now we have derived it from the joint by collapsing across the second variable.

Why does the word "marginal" turn up here? It comes from old probability tables, where you would write the joint in a grid and put the row totals down the right-hand margin and the column totals along the bottom. Those margin numbers are the marginal distributions. The word stuck.

Marginalising is doing useful, often hard, work in AI. In a Bayesian model with hidden variables $\mathbf{Z}$, the prediction we actually want, say $P(\text{label} \mid \text{data})$, is a marginal in which the hidden $\mathbf{Z}$ has been integrated out. In graphical models, algorithms such as belief propagation are essentially clever ways of marginalising over many variables without enumerating every combination. In Bayesian model selection the marginal likelihood weighs how well a model fits the data, with parameters marginalised away. Whenever a quantity is "summed out" or "integrated out", marginalisation is at work.

Conditioning

Conditioning answers the opposite question: "I have just learned the value of one variable. How does that change my beliefs about the others?" If $Y$ takes the value $y$, the conditional distribution of $X$ given $Y = y$ is $$ P(X = x \mid Y = y) = \frac{P(X = x, Y = y)}{P(Y = y)}, \qquad p(x \mid y) = \frac{p(x, y)}{p(y)}. $$ The numerator is the joint at $(x, y)$. The denominator is the marginal of $Y$ at $y$, which we get by marginalising as in the previous subsection. Dividing by the marginal renormalises the joint so that the resulting numbers add up to one across the values of $X$. For each fixed $y$, the conditional is a perfectly ordinary distribution over $X$.

A worked example using the two coins. We have $P(X = H, Y = H) = 0.25$ and we computed $P(Y = H) = 0.5$. Therefore $$ P(X = H \mid Y = H) = \frac{0.25}{0.5} = 0.5. $$ Knowing that the second coin came up heads tells us nothing new about the first coin, which still has a fifty-fifty chance of being heads. That is just what independence ought to look like, and we will return to it shortly.

Two intuitions help. Conditioning slices the joint table. When we condition on $Y = y$, we throw away every row of the joint where $Y \ne y$ and keep only the row $Y = y$. The numbers in that row are no longer probabilities, they do not add to one, so we divide through by their total to make them a distribution again. The shape of the conditional comes from the slice; the rescaling just normalises it.

Conditioning is how we use evidence in AI. Almost every inference task is "given the input $\mathbf{x}$, what is the distribution over the output $y$?", that is, we want $p(y \mid \mathbf{x})$. Bayes' theorem is conditioning written backwards; classifiers compute conditionals; language models predict the next token by conditioning on what came before. Whenever a model "takes the evidence into account", it is conditioning.

Independence revisited

Two variables are independent if the joint splits cleanly into the product of the marginals, $$ P(X = x, Y = y) = P(X = x) \, P(Y = y) \quad \text{for all } x, y, $$ or equivalently, if $p(x \mid y) = p(x)$ for every $y$ where the conditional is defined. Knowing $Y$ teaches us nothing about $X$. The two coins above are independent: each cell of the joint really is $0.5 \times 0.5 = 0.25$, and $P(X = H \mid Y = H) = 0.5 = P(X = H)$.

A small counterexample shows what dependence looks like. A bag holds $5$ red and $5$ blue marbles. Draw two without replacement. The marginal probability that the first draw is red is $5/10 = 0.5$. The marginal probability that the second draw is red, by symmetry, is also $0.5$. But conditioning on a red first draw, $$ P(\text{2nd red} \mid \text{1st red}) = \frac{4}{9} \approx 0.444 \ne 0.5. $$ The two draws are not independent, knowing the first changes the second. Replacing the marble after the first draw would make them independent again.

In AI, full independence is a strong assumption that almost never holds for raw features. Conditional independence, however, variables independent given a third variable, is the structural assumption behind many tractable models. A naive Bayes classifier assumes the features are independent given the class, which is enough to make the model fit on a laptop even when no one believes the assumption literally.

Chain rule for joints

Conditioning gives us a way to break a joint apart. Rearranging $p(x \mid y) = p(x, y)/p(y)$ gives the two-variable chain rule $$ p(x, y) = p(y) \, p(x \mid y), $$ which extends to any number of variables by repeated conditioning: $$ p(x_1, x_2, x_3) = p(x_1) \, p(x_2 \mid x_1) \, p(x_3 \mid x_1, x_2), $$ and in general $$ p(x_1, \ldots, x_n) = \prod_{i=1}^n p(x_i \mid x_1, \ldots, x_{i-1}). $$ A joint over many variables is always a product of conditionals. This factorisation is the engineering backbone of large parts of modern AI:

  • Bayesian networks use the chain rule plus an independence graph: each variable depends only on its parents in the graph, and the joint becomes $\prod_i p(x_i \mid \text{parents}(x_i))$.
  • Autoregressive language models apply the chain rule to a sentence: $p(w_1, w_2, \ldots, w_T) = \prod_{t=1}^T p(w_t \mid w_{\lt t})$. A transformer is, at heart, a parametrised approximation to each conditional $p(w_t \mid w_{\lt t})$.
  • Sequential generative models for images, audio, and video (PixelRNN, WaveNet, and many diffusion-adjacent models) use the same idea, conditioning each step on what was generated before.

Whenever you see a model that produces one thing at a time and feeds previous outputs back in as input, the chain rule is the reason the maths works.

Worked example: spam classification with two features

Suppose we want a tiny spam filter using two binary features: $X_1 = 1$ if the email contains the word free and $X_1 = 0$ otherwise; $X_2 = 1$ if it contains click, $X_2 = 0$ otherwise. The label is $Y \in \{\text{spam}, \neg\text{spam}\}$. The full joint $P(X_1, X_2, Y)$ has $2 \times 2 \times 2 = 8$ entries.

We will use the following numbers, taken from a hypothetical training set: $P(\text{spam}) = 0.3$, so $P(\neg\text{spam}) = 0.7$. $P(X_1 = 1 \mid \text{spam}) = 0.7$, $P(X_1 = 1 \mid \neg\text{spam}) = 0.1$. $P(X_2 = 1 \mid \text{spam}) = 0.7$, $P(X_2 = 1 \mid \neg\text{spam}) = 0.1$. A naive Bayes classifier assumes $X_1$ and $X_2$ are conditionally independent given $Y$, which lets us write the joint as $$ P(X_1, X_2, Y) = P(Y) \, P(X_1 \mid Y) \, P(X_2 \mid Y). $$

A new email arrives with $X_1 = 1$ and $X_2 = 1$. We want $P(\text{spam} \mid X_1 = 1, X_2 = 1)$. Bayes' theorem says this is proportional to the joint at $(1, 1, \text{spam})$: $$ P(\text{spam} \mid X_1=1, X_2=1) \propto P(\text{spam}) \, P(X_1=1 \mid \text{spam}) \, P(X_2=1 \mid \text{spam}) = 0.3 \cdot 0.7 \cdot 0.7 = 0.147. $$ And for the not-spam hypothesis, $$ P(\neg\text{spam} \mid X_1=1, X_2=1) \propto 0.7 \cdot 0.1 \cdot 0.1 = 0.007. $$ Normalising, $$ P(\text{spam} \mid X_1=1, X_2=1) = \frac{0.147}{0.147 + 0.007} = \frac{0.147}{0.154} \approx 0.954. $$ The classifier is confident the email is spam, even though spam is the rarer class overall. Every step here used a tool from this section: the chain rule (to factorise the joint), conditional independence (to simplify the factorisation), and conditioning combined with renormalising (to compute the posterior).

What you should take away

  1. A joint distribution describes several variables at once, by a table of probabilities in the discrete case and by a density in the continuous case, and contains everything else as a consequence.
  2. Marginalising means summing or integrating over the variables you do not care about; it produces the distribution of the variables you do care about.
  3. Conditioning means slicing the joint at a known value and renormalising; it is how a model uses evidence.
  4. Independence is the special case where the joint factorises into a product of marginals; conditional independence is the much weaker, much more useful assumption that powers naive Bayes and Bayesian networks.
  5. The chain rule decomposes any joint into a product of conditionals, and is the structural backbone of autoregressive language models, sequential generative models, and graphical models.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).