4.2 Probability basics
Before we can talk about Bayes' theorem, before we can speak the language of distributions, before we can train a model that knows when to be unsure, we need a small but careful vocabulary. That vocabulary has only a handful of words in it: sample space, event, conditional probability, independence, joint probability. Each of these is a simple idea dressed in slightly forbidding notation. Once the notation feels ordinary, the rest of the chapter, and a great deal of the rest of the book, becomes a matter of recognising old friends in new clothes. This section introduces those words slowly, with worked examples drawn from the kinds of problems you will meet in artificial intelligence: dice that stand in for noisy sensors, coins that stand in for diagnostic tests, and email words that stand in for the features of a spam classifier.
This section builds the working vocabulary used for the rest of the chapter.
Sample space and events
The sample space, written $\Omega$ (the Greek capital omega), is the set of all possible outcomes of whatever experiment we have in mind. It is a complete catalogue of what could happen, drawn up before the experiment is run. An event is any subset of $\Omega$, that is, any collection of outcomes we might want to ask a yes-or-no question about. The probability of an event is a number between zero and one that says how likely we think the event is.
Take the simplest example imaginable. We roll a fair six-sided die. The sample space is $\Omega = \{1, 2, 3, 4, 5, 6\}$, because those are the only outcomes we could see. A particular outcome, written $\omega$, might be $\omega = 4$. The event "roll an even number" is the subset $A = \{2, 4, 6\}$. Because the die is fair, every outcome has the same probability $1/6$, so $P(A) = 3/6 = 0.5$. Notice the structure: the sample space is the universe, an event is a region of that universe, and a probability is the size of the region as a fraction of the whole.
A second example will help. Imagine sending a single bit across a noisy wireless channel. The bit can flip or arrive intact, so $\Omega = \{\text{flipped}, \text{intact}\}$. If the channel flips bits with probability $0.01$, then $P(\{\text{flipped}\}) = 0.01$ and $P(\{\text{intact}\}) = 0.99$. This tiny example is, in miniature, the kind of model AI engineers reach for when they want to reason about sensor noise, mistyped queries, or corrupted training labels.
Sample spaces can be much larger. The set of possible English sentences of length one hundred words is finite but astronomical. The set of possible images at a given resolution is finite but larger still. For continuous outcomes, measuring a temperature, a body weight, a distance, the sample space is uncountable, often $\Omega = \mathbb{R}$ or some interval. We then have to be slightly careful about which subsets count as events: not every subset of the real line can be assigned a probability without contradiction, so we restrict ourselves to a well-behaved family called the Borel sets. In practice this restriction never bites in machine learning, and you can carry the intuition from the die forward to the continuous case unchanged.
A useful habit, when you meet a new probability problem, is to pause and write down the sample space explicitly before reaching for any formula. What are the outcomes? Are they equally likely, or do some have more weight than others? Is the space discrete or continuous? Many beginner errors, and a fair few errors made by experienced practitioners, come from skipping this step and computing on a sample space that has not been pinned down. Once $\Omega$ is on paper, the probabilities of events are usually easy to write down as well.
Joint events and the union/intersection rules
Once we have events, we can combine them. The combinations are the standard set operations, written with the standard set-theoretic symbols.
- The union $A \cup B$ is the event "$A$ happens, or $B$ happens, or both". It is the set of outcomes that lie in $A$ or in $B$ or in both at once.
- The intersection $A \cap B$ is the event "both $A$ and $B$ happen". It is the set of outcomes that lie in $A$ and in $B$.
- The complement $A^c$ is the event "$A$ does not happen". It is everything in $\Omega$ except $A$.
These three operations are enough to phrase almost any everyday question about chance. "Will it rain or snow tomorrow?" is a union. "Will it rain and be windy?" is an intersection. "Will it not rain?" is a complement.
The probability of a complement is easy: $P(A^c) = 1 - P(A)$, because the event $A$ and its complement together cover the whole sample space exactly once. The probability of an intersection depends on whether the events overlap, and in general we cannot deduce it from $P(A)$ and $P(B)$ alone, we will need extra structure (independence, or a model) to compute it.
The probability of a union, however, has an exact formula known as the inclusion-exclusion principle. For two events, $$P(A \cup B) = P(A) + P(B) - P(A \cap B).$$ We add the two probabilities, then subtract the overlap because we double-counted it. If $A$ and $B$ cannot both happen (they are disjoint), then $P(A \cap B) = 0$ and the formula collapses to a simple sum.
A worked example pins this down. Take an ordinary deck of fifty-two playing cards and draw one card uniformly at random. Let $R$ be the event "the card is red" and $F$ be the event "the card is a face card" (jack, queen or king). There are $26$ red cards, so $P(R) = 26/52 = 0.5$. There are $12$ face cards, so $P(F) = 12/52 \approx 0.231$. The two events overlap on the six red face cards (the red jacks, queens and kings), giving $P(R \cap F) = 6/52 \approx 0.115$. Inclusion-exclusion then says $$P(R \cup F) = 0.5 + 0.231 - 0.115 = 0.616.$$ About sixty-two per cent of cards are red, a face card, or both.
For three or more events the formula keeps growing, alternately adding pairwise overlaps and subtracting triple overlaps; for our purposes the two-event version is the workhorse. There is also a cheap and useful upper bound called the union bound, which simply says $P(A_1 \cup A_2 \cup \cdots) \le P(A_1) + P(A_2) + \cdots$. It is loose, but it is the everyday tool used to bound the probability that anything bad happens in a long list of independent risks, a common pattern in the analysis of learning algorithms.
Conditional probability
The most important operation in probability for AI is conditioning. Conditioning is what we do whenever we update our beliefs in light of new information. Symbolically, the conditional probability of $A$ given $B$ is $$P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \qquad P(B) > 0.$$ The vertical bar reads as "given". The formula has a plain-English meaning: among the outcomes in which $B$ has occurred, what fraction also have $A$? We have shrunk our universe from $\Omega$ down to $B$, then asked how much of the new, smaller universe is occupied by $A$.
Back to the deck of cards. The probability that a randomly drawn card is an ace, given that it is red, is $$P(\text{ace} \mid \text{red}) = \frac{P(\text{red ace})}{P(\text{red})} = \frac{2/52}{26/52} = \frac{2}{26} \approx 0.077.$$ There are two red aces among twenty-six red cards. Note that this conditional probability happens to equal the unconditional probability $P(\text{ace}) = 4/52 \approx 0.077$. Knowing the card is red has not changed how surprised we are to see an ace, because aces are spread evenly across colours. We will give this lack-of-effect a name in a moment.
A clinical example is closer to the spirit of the chapter. Suppose a rapid test for some condition has sensitivity $P(\text{positive} \mid \text{disease}) = 0.95$ and specificity $P(\text{negative} \mid \text{healthy}) = 0.98$. These are conditionals, not absolutes. The probability of being diseased given a positive test, $P(\text{disease} \mid \text{positive})$, is a different conditional and depends on how common the disease is in the first place. Beginners in clinical AI very often confuse the two directions, with serious practical consequences. Bayes' theorem in section 4.3 is exactly the formula that converts between them. For now, simply remember: $P(A \mid B)$ and $P(B \mid A)$ are different numbers, even when they look similar on the page.
Two consequences of the definition appear so often that they have their own names. The first is the multiplication rule, which is just the conditional-probability definition rearranged: $P(A \cap B) = P(A \mid B)\, P(B) = P(B \mid A)\, P(A)$. It tells us how to compute the probability of two events both happening, given a conditional and a marginal. The second is the law of total probability, which says that if events $B_1, B_2, \ldots, B_k$ partition the sample space, that is, exactly one of them happens, then for any event $A$, $$P(A) = \sum_{i=1}^{k} P(A \mid B_i)\, P(B_i).$$ We compute the probability of $A$ by averaging its conditional probability across every way the world might be, weighted by how likely each of those ways is. This is the integration step that converts likelihoods into evidence in Bayes' theorem, and we will lean on it heavily in the next section.
Independence
Two events $A$ and $B$ are independent, written $A \perp B$, when knowing that one of them happened gives no information about the other. Formally, $$A \perp B \iff P(A \cap B) = P(A)\, P(B).$$ An equivalent way of saying the same thing, when $P(B) > 0$, is that $P(A \mid B) = P(A)$: the conditional probability equals the unconditional one. Conditioning on $B$ has done nothing.
A classic example. Roll two fair dice. Let $A$ be "the first die shows a six" and $B$ be "the second die shows a six". Each event has probability $1/6$. The intersection $A \cap B$ is "both dice show a six", which is one outcome out of thirty-six, so $P(A \cap B) = 1/36$. And indeed $P(A)\, P(B) = (1/6)(1/6) = 1/36$. The two events are independent, exactly as physical intuition suggests: nothing about the first die's outcome reaches across the table to influence the second.
Counterexamples are just as instructive. Roll two fair dice and let $A$ be "the first die shows a six" and $C$ be "the sum of the two dice is twelve". Each outcome of the first die makes a different sum possible: only when the first die is a six can the sum reach twelve. Here $P(A) = 1/6$ and $P(C) = 1/36$, but $P(A \cap C) = 1/36$, while $P(A)P(C) = 1/216$. The two events are not independent, knowing the first die was a six makes a sum of twelve much more likely than it would otherwise be. This is the everyday situation in machine learning: features, labels, time steps and sensor readings are usually entangled, and assuming independence when it does not hold is a reliable way to ship a broken model.
For three or more events there is a subtle trap. Pairwise independence, every pair independent of each other, does not imply that the whole collection is mutually independent. The classic counterexample uses two fair coin flips $X$ and $Y$ together with their exclusive-or $Z = X \oplus Y$: any pair of $\{X, Y, Z\}$ is independent, yet once you know two of them the third is fixed. Genuine mutual independence requires the joint probability to factor into the product of marginals for every subset, not just every pair.
Conditional independence
A weaker but more useful notion is conditional independence. Two events $A$ and $B$ are conditionally independent given a third event $C$, written $A \perp B \mid C$, when $$P(A \cap B \mid C) = P(A \mid C)\, P(B \mid C).$$ The point is that $A$ and $B$ may be dependent in general, while becoming independent once we know $C$. Information about $C$ has explained away the apparent link between them.
Spam classification is the canonical example. The words "free" and "click" both appear more often in spam than in ordinary email, so over the whole corpus the events "the email contains free" and "the email contains click" are positively correlated. But once we condition on the class, once we know whether the email is spam or not, the dependence largely vanishes: within the spam pile each word appears more or less according to its own rate, and within the non-spam pile likewise. Treating word occurrences as conditionally independent given the class is the central simplifying assumption of the Naive Bayes classifier. The assumption is wrong in fine detail, but it is approximately right and turns an intractable joint distribution over thousands of binary word features into a product of one-dimensional ones. Many tractable AI models are built on a single conditional-independence assumption of this shape.
Conditional independence is the language of graphical models, hidden Markov models, Kalman filters and Bayesian networks. Whenever you see a model drawn as a graph, the graph is encoding which variables become independent of which others once their neighbours are known. A directed arrow from one node to another roughly says "this variable depends on that one"; the absence of an arrow says "these are conditionally independent given the rest". The art of building a tractable probabilistic model is largely the art of choosing the right conditional-independence assumptions: strong enough to make the maths cheap, weak enough to leave the structure that actually matters in your data.
The chain rule for joint probabilities
Finally, every joint probability, no matter how many events are involved, can be written as a product of conditionals using the chain rule: $$P(A_1, A_2, \ldots, A_n) = P(A_1)\, P(A_2 \mid A_1)\, P(A_3 \mid A_1, A_2) \cdots P(A_n \mid A_1, \ldots, A_{n-1}).$$ The right-hand side reads from left to right like a story. We start with the marginal probability of the first event. We then multiply by the probability of the second given the first. Then the probability of the third given the first two. And so on, until the last event has been conditioned on everything that came before.
The chain rule is the mathematical backbone of every modern autoregressive language model. A sentence is a sequence of word tokens $w_1, w_2, \ldots, w_n$. Its joint probability factors exactly as $$P(w_1, w_2, \ldots, w_n) = P(w_1)\, P(w_2 \mid w_1)\, P(w_3 \mid w_1, w_2) \cdots P(w_n \mid w_1, \ldots, w_{n-1}).$$ Each conditional $P(w_t \mid w_1, \ldots, w_{t-1})$ is exactly what a transformer language model is trained to predict from the preceding context. Generating text is then a matter of sampling each next word in turn from this conditional. The same factorisation underlies image generation models that predict pixels in raster order, audio models that predict samples one at a time, and code models that predict the next token of a program. Three lines of definition do an enormous amount of work.
What you should take away
- The whole language of probability is built from a sample space $\Omega$ of possible outcomes, events that are subsets of $\Omega$, and a function $P$ that assigns each event a number between zero and one.
- Events combine through union, intersection and complement; the inclusion-exclusion rule $P(A \cup B) = P(A) + P(B) - P(A \cap B)$ is the workhorse for "or" questions.
- Conditional probability $P(A \mid B) = P(A \cap B)/P(B)$ is how we update beliefs in the light of new information, and $P(A \mid B) \neq P(B \mid A)$ in general.
- Independence ($P(A \cap B) = P(A) P(B)$) is the rare and convenient case where conditioning changes nothing; conditional independence is the more useful real-world notion that powers Naive Bayes and graphical models.
- The chain rule factors any joint probability into a product of conditionals and is the mathematical backbone of every autoregressive language model.