Glossary

Joint Distribution

A Joint Distribution specifies the probability of every possible combination of values for two or more random variables simultaneously. For discrete variables $X$ and $Y$, it is given by the joint PMF $p(x, y) = P(X = x, Y = y)$; for continuous variables, by the joint PDF $f(x, y)$. The joint distribution encodes everything there is to know about the relationship between the variables: their individual behaviours, dependencies, and conditional structure.

From a joint distribution one can recover the marginal distribution of each variable by summing or integrating out the others: $p(x) = \sum_y p(x, y)$ or $f(x) = \int f(x, y),dy$. This operation, called marginalisation, is one of the two fundamental operations of probabilistic inference (the other being conditioning). It is central to Bayesian model selection, belief propagation in graphical models, and variational inference.

Joint distributions over high-dimensional random vectors quickly become intractable if treated explicitly—$K$ binary variables have $2^K$ joint states. Graphical models make joint distributions tractable by encoding conditional independence assumptions through graph structure. A Bayesian network factorises the joint as $p(x_1, \ldots, x_n) = \prod_i p(x_i \mid \text{parents}(x_i))$, dramatically reducing the number of parameters. Modern autoregressive models, including large language models, use the chain rule of probability to factorise joint distributions over token sequences.

Related terms: Conditional Probability

Discussed in:

Also defined in: Textbook of AI