Conditional Probability, Glossary, Textbook of AI

The Conditional Probability of $A$ given $B$ is defined as $P(A \mid B) = P(A \cap B) / P(B)$, provided $P(B) > 0$. This deceptively simple ratio encodes the fundamental operation of learning from evidence: we begin with a prior belief about $A$, observe $B$, and update our belief to reflect the new information. Conditional probability pervades every supervised learning algorithm, which can be interpreted as estimating or approximating some conditional distribution $p(y \mid x)$, the probability of a label given features.

The chain rule of probability generalises conditional probability to sequences: $p(x_1, \ldots, x_n) = p(x_1) \prod_{i=2}^n p(x_i \mid x_1, \ldots, x_{i-1})$. This factorisation is the mathematical heart of autoregressive language models: a language model factorises the joint distribution over tokens into a product of conditionals, each parameterised by a neural network that predicts the next token given everything before.

Conditional independence, the statement that $X$ and $Y$ are independent given $Z$, is a weaker but more useful assumption than full independence. It is the organising principle of graphical models: a Bayesian network encodes a set of conditional independences through its graph structure, and efficient inference algorithms exploit this structure. The naive Bayes classifier assumes all features are conditionally independent given the class label, an assumption that is almost never exactly true but produces surprisingly effective classifiers in practice.

Interactive

Joint, marginal, and conditional distributions. A joint distribution lives over two axes. Marginalise to one axis, condition on a slice.

Video

Related terms: Bayes' Theorem, Joint Distribution

Discussed in:

Chapter 4: Probability, Joint & Conditional Probability

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.