Joint Distribution, Glossary, Textbook of AI

A Joint Distribution specifies the probability of every possible combination of values for two or more random variables simultaneously. For discrete variables $X$ and $Y$ it is given by the joint probability mass function $p(x, y) = P(X = x, Y = y)$; for continuous variables, by the joint probability density function $f(x, y)$, with probabilities computed as integrals $P((X, Y) \in A) = \iint_A f(x, y)\,dx\,dy$. The joint distribution encodes everything there is to know probabilistically about the relationship between the variables: their individual behaviours, their dependencies, and the conditional structure that links them.

From a joint distribution one can recover the marginal distribution of each variable by summing or integrating out the others:

$$p(x) = \sum_y p(x, y), \qquad f(x) = \int f(x, y)\,dy.$$

This operation, called marginalisation, is one of the two fundamental operations of probabilistic inference; the other, conditioning, recovers the conditional distribution $p(y \mid x) = p(x, y) / p(x)$. Together with Bayes's theorem $p(\theta \mid D) = p(D \mid \theta) p(\theta) / p(D)$, where the evidence $p(D) = \int p(D \mid \theta) p(\theta)\ ,d\theta$ is itself a marginal, these operations underwrite all of Bayesian statistics, belief propagation in graphical models, and variational inference.

Joint distributions over high-dimensional random vectors quickly become intractable if treated explicitly: $K$ binary variables have $2^K$ joint states, so the full joint over even a modest 50 variables already exceeds $10^{15}$ entries. Probabilistic graphical models make joint distributions tractable by encoding conditional independence assumptions through graph structure. A Bayesian network factorises the joint as

$$p(x_1, \ldots, x_n) = \prod_{i=1}^n p(x_i \mid \mathrm{parents}(x_i)),$$

dramatically reducing the number of parameters required and rendering inference algorithms such as variable elimination and belief propagation practical. Markov random fields factorise the joint instead over cliques of an undirected graph, and conditional random fields are their discriminative cousins. Modern autoregressive models, including large language models, use the chain rule of probability $p(x_1, \ldots, x_n) = \prod_i p(x_i \mid x_{neural network. Diffusion models and normalising flows parametrise joint densities through invertible mappings to simpler base distributions; energy-based models specify an unnormalised joint and rely on Monte Carlo to handle the partition function. In every case the central object, the structure being modelled, is a joint distribution over the high-dimensional variables of interest.

Interactive

Joint, marginal, and conditional distributions. A joint distribution lives over two axes. Marginalise to one axis, condition on a slice.

Video

Discussed in:

Chapter 6: ML Fundamentals, Mathematical Foundations

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.