Probability: 4.4   Random variables, PMFs and PDFs

Dr Chris Paton

4.4 Random variables, PMFs and PDFs

In §4.2 we worked with events, collections of outcomes such as "the coin lands heads" or "the patient tests positive". Events have probabilities, and probabilities combine according to the axioms. That is enough to reason qualitatively, but it is not enough to reason quantitatively. As soon as we want to talk about averages, errors, losses, gradients or expected rewards, we need to attach a number to each outcome. That is what a random variable is: a way of saying, for every possible outcome the world could produce, "and the number we care about in that case is this".

Read the phrase "random variable" charitably. It is not a variable in the algebraic sense and it is not, strictly speaking, random. A random variable is a function, a deterministic rule, that takes an outcome and returns a number. The randomness lives upstream, in which outcome the world actually produces. Once an outcome is fixed, the value of the random variable is fixed too. The image is closer to a measuring instrument than to a die: the die rolls (that is the random part), the instrument reads off a number (that is the random variable).

Once we have a random variable $X$, two further objects describe it. The probability mass function (PMF) is for cases where $X$ can only take a discrete list of values, the face of a die, the number of patients who arrive in an hour, the predicted token at a given position. The PMF tells you the probability of each value. The probability density function (PDF) is for cases where $X$ can take any real value in some range, a temperature, a blood pressure, a logit, a learning rate. The PDF does not give probabilities directly; it gives a density that you integrate over a range to obtain a probability. Both are summarised together by the cumulative distribution function (CDF) $F(x) = P(X \le x)$, which works in either setting.

This section makes those three objects precise, shows how to compute with them, and ends with functions of random variables and random vectors. Specific named distributions are catalogued in §4.5.

Symbols Used Here

$X$random variable

$P(X=x)$PMF: probability that $X$ equals $x$ (discrete)

$p(x)$PDF: density at $x$ (continuous)

$F(x) = P(X \le x)$cumulative distribution function

$\Omega$sample space

$\mathbb{R}$real line

Random variables, formally

Recall that a probability space is a triple $(\Omega, \mathcal{F}, P)$: a sample space of possible outcomes, a collection of events to which we are willing to assign probabilities, and a probability measure that does the assigning. A random variable $X$ is a function $X : \Omega \to \mathbb{R}$, or, more generally, $X : \Omega \to S$ for some other state space $S$ such as $\mathbb{R}^d$, the integers, or even a finite set of categories. The phrase "measurable function" appears in textbooks because we also require that for any reasonable subset $B$ of the target space, the preimage $\{\omega : X(\omega) \in B\}$ is an event we are allowed to assign a probability to. For everything we will do in this book that condition is satisfied automatically; you can read past the word "measurable" without anxiety.

Two small worked examples will fix the idea.

First, roll a fair six-sided die. The sample space is $\Omega = \{1, 2, 3, 4, 5, 6\}$. Define the random variable $X$ by $X(\omega) = \omega$, the value of the face. Then $X$ takes the values $1$ through $6$, each with probability $1/6$. This is the simplest possible random variable: it just hands back the outcome, treated as a number.

Second, roll the same die but define a different random variable $Y$ by $Y(\omega) = 1$ if $\omega$ is even and $Y(\omega) = 0$ otherwise. Now $Y$ only takes two values. The probability that $Y = 1$ is the probability of rolling an even number, which is $3/6 = 1/2$. The point is that $X$ and $Y$ live on the same underlying experiment, the same die roll, but extract different information. A random variable is a summary of the outcome.

The distribution of $X$ is the rule that tells us $P(X \in A)$ for any reasonable set $A \subseteq \mathbb{R}$. In the discrete case it is enough to know $P(X = x)$ for each value $x$ that $X$ can take, because every event is a union of such atoms. In the continuous case single-point probabilities are all zero and we describe the distribution by a density that we integrate over the set $A$. Both descriptions specify the distribution completely; the CDF unifies them.

A useful habit from the start: distinguish the random variable $X$ (the rule) from a particular value $x$ (a number it might take). Lowercase $x$ is a placeholder; uppercase $X$ is the variable. Mathematics texts are strict about this and so will we be.

Probability mass functions (discrete)

A discrete random variable is one that takes values from a countable set $\mathcal{X} = \{x_1, x_2, \ldots\}$. The set may be finite (the six faces of a die) or countably infinite (the non-negative integers, as for a count of arrivals). The probability mass function is

$$ p_X(x_i) = P(X = x_i), \qquad i = 1, 2, \ldots $$

Two properties characterise it:

Non-negativity: $p_X(x_i) \ge 0$ for every $i$.
Total mass one: $\sum_i p_X(x_i) = 1$.

Worked example. For a fair die, $p_X(i) = 1/6$ for $i = 1, \ldots, 6$, and the sum is $6 \times 1/6 = 1$. Computing $P(X \in \{2, 4, 6\})$, the probability of an even roll, is just $p_X(2) + p_X(4) + p_X(6) = 3/6 = 1/2$, as we found above with the random variable $Y$.

A second worked example. Suppose a token classifier has produced logits over a vocabulary of size four and a softmax has converted them to probabilities $0.5, 0.3, 0.15, 0.05$ for tokens A, B, C, D. The predicted token is a discrete random variable whose PMF is exactly that vector. The two properties are satisfied: every entry is non-negative, and they sum to one. This is a microcosm of what every language model emits at every position.

PMFs compose cleanly. If $X$ has PMF $p_X$ and we form a new variable by partitioning the values, the probability of the new event is just the sum of the old masses. To get the probability that $X$ is at most some value $x$, you sum $p_X$ from below: $F_X(x) = \sum_{x_i \le x} p_X(x_i)$. We will see this CDF reappear in the continuous case.

A common pitfall is to imagine that a PMF must take values close to zero. It need not. The PMF of a deterministic variable that is always five has $p_X(5) = 1$ and $p_X(x) = 0$ everywhere else; that is a perfectly valid distribution, just an unexciting one. What is required is that the probabilities are non-negative and add to one, not that they are individually small.

Probability density functions (continuous)

Now suppose $X$ can take any real value in some range, the temperature in a ward, the time until a server responds, a hidden activation. There are uncountably many possible values and we cannot list a probability for each one. Instead we describe the distribution with a probability density function $p$, also written $f_X$ when the random variable matters. The defining property is

$$ P(X \in [a, b]) = \int_a^b p(x)\, dx, $$

with

$$ p(x) \ge 0, \qquad \int_{-\infty}^{\infty} p(x)\, dx = 1. $$

Two warnings are essential, and beginners run into both.

First, $p(x)$ is not a probability. It is a density, in the same sense that mass per unit length is not mass. You only get a probability after you integrate $p$ over an interval. As a consequence, $p(x)$ can exceed one. A Gaussian with standard deviation $\sigma = 0.1$ has peak density $1/(0.1 \sqrt{2\pi}) \approx 3.99$, almost four, and that is fine because the density falls away quickly enough on either side that the total area remains one. The number you should sanity-check is the integral, not the height.

Second, in the continuous case the probability that $X$ equals any particular value is zero. The integral of $p$ over a single point is zero because a single point has zero width. Probabilities only attach to intervals, or, more generally, to sets of positive measure. This is the source of beginner confusion, surely $X$ takes some value? It does, but every individual value is "infinitely unlikely" relative to the continuum. The right question is "what is the probability that $X$ falls in this region?", not "what is the probability that $X$ equals this number?".

Worked example one: the uniform distribution on $[0, 1]$. By definition $p(x) = 1$ for $x \in [0, 1]$ and $p(x) = 0$ otherwise. The integral over $[0, 1]$ is $1 \cdot 1 = 1$. The probability that $X \in [0.2, 0.5]$ is $\int_{0.2}^{0.5} 1\, dx = 0.3$. Notice how natural this feels: a third of the unit interval has a third of the probability.

Worked example two: an exponential service time with rate $\lambda = 2$ per second. Here $p(x) = 2 e^{-2x}$ for $x \ge 0$ and zero for $x < 0$. The integral is $\int_0^\infty 2 e^{-2x}\, dx = 1$. The probability that the service finishes within the first second is $\int_0^1 2 e^{-2x}\, dx = 1 - e^{-2} \approx 0.865$. The probability that it takes between one and two seconds is $\int_1^2 2 e^{-2x}\, dx = e^{-2} - e^{-4} \approx 0.117$. This sort of integral is the daily bread of survival analysis and queueing.

Worked example three: a Gaussian density on the real line with mean $\mu = 0$ and variance $\sigma^2 = 1$ has $p(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}$. Its peak height is $\approx 0.399$ at $x = 0$, and the area under the whole curve is one (a fact that takes a little work to prove but is worth taking on trust for now). About 68 per cent of the mass is within one standard deviation of the mean, 95 per cent within two, 99.7 per cent within three; we use those numbers constantly to reason about training noise and model uncertainty.

There are also mixed distributions that have both a continuous part and atoms. A familiar example in deep learning is the output of a rectified linear unit, $X = \max(0, Z)$ where $Z \sim \mathcal{N}(0, 1)$. With probability $1/2$ the value is exactly zero, a literal point mass, and otherwise it is the positive half of a Gaussian. Such mixtures appear in censored regression, in survival analysis with right-censoring, and in zero-inflated count models. CDFs handle them gracefully; PDFs and PMFs, on their own, do not.

Cumulative distribution function

The cumulative distribution function $F_X(x) = P(X \le x)$ is defined for any real-valued random variable, discrete or continuous, and ties everything together. It has four properties that any CDF must satisfy:

Non-decreasing, as you move $x$ to the right you accumulate more probability.
$\lim_{x \to -\infty} F_X(x) = 0$.
$\lim_{x \to +\infty} F_X(x) = 1$.
Right-continuous: $\lim_{h \downarrow 0} F_X(x + h) = F_X(x)$.

For a discrete variable, $F_X$ is a staircase: flat between possible values, with a vertical jump of size $p_X(x_i)$ at each $x_i$. For a continuous variable, $F_X$ is a smooth increasing curve and $p(x) = F_X'(x)$ wherever the derivative exists.

Worked example: the standard normal $X \sim \mathcal{N}(0, 1)$. By symmetry $F(0) = 0.5$. The standard tabulated values give $F(1) \approx 0.841$ and $F(-1) \approx 0.159$, so the probability of falling within one standard deviation of zero is $0.841 - 0.159 = 0.682$, the familiar 68 per cent. Likewise $F(2) \approx 0.977$ gives $P(|X| \le 2) \approx 0.954$, the 95 per cent figure. These four numbers, $0.5$, $0.841$, $0.977$, $0.999$, are worth memorising; they let you do back-of-envelope reasoning about Gaussian noise without reaching for a table.

CDFs are the natural object for sampling. If $U \sim \text{Uniform}(0, 1)$ and $F$ is a CDF with inverse $F^{-1}$, then $X = F^{-1}(U)$ has CDF $F$. This is the inverse-CDF method, which we revisit in §4.13. It works for discrete and continuous variables alike, wherever you can compute (or invert) a CDF, you can sample.

Transformations of random variables

In a neural network, every activation is a function of earlier activations, which are functions of inputs and weights. In other words, every internal random variable is a transformation of upstream random variables. Knowing how distributions change under transformations is therefore not a mathematical curiosity; it is the central calculation of generative modelling.

If $Y = g(X)$, the distribution of $Y$ is determined by $g$ and the distribution of $X$.

In the discrete case, summation suffices:

$$ P(Y = y) = \sum_{x : g(x) = y} P(X = x). $$

You collect together every $x$ that $g$ sends to the same $y$ and add up their probabilities.

In the continuous case with $g$ monotonic and differentiable, the change-of-variables formula is

$$ p_Y(y) = p_X\!\bigl(g^{-1}(y)\bigr) \left| \frac{d g^{-1}(y)}{dy} \right|. $$

The Jacobian factor $|d g^{-1} / dy|$ accounts for the fact that $g$ may stretch some regions and compress others; if it doubles distances, the density there must halve to keep total probability at one.

Worked example: the inverse-CDF construction of an exponential. Let $X \sim \text{Uniform}(0, 1)$ and $Y = -\log(1 - X) / \lambda$. Then $Y$ is exponential with rate $\lambda$. You can check this by computing $P(Y \le y) = P(X \le 1 - e^{-\lambda y}) = 1 - e^{-\lambda y}$, which is the exponential CDF. The same Jacobian machinery gives a density check.

Worked example: log-normal. If $X \sim \mathcal{N}(\mu, \sigma^2)$ and $Y = e^X$, then $Y > 0$ always and the density on $(0, \infty)$ is $p_Y(y) = \frac{1}{y \sigma \sqrt{2\pi}} \exp\!\bigl(-(\log y - \mu)^2 / 2\sigma^2\bigr)$. Multiplicative phenomena (file sizes, incomes, learning rates in Bayesian hyperparameter search) are typically log-normal, which is why we tune learning rates on a log grid.

The same Jacobian factor, generalised to a determinant in the multivariate case, is exactly what normalising flows track in deep generative models. By stacking invertible neural transformations and accumulating $\log|\det J|$ at each layer, you can compute exact likelihoods of complex distributions while remaining able to sample from them.

Multivariate random variables

Most quantities we care about in machine learning are vectors. A random vector is just a tuple $\mathbf{X} = (X_1, \ldots, X_d)$ in which each component is itself a random variable on the same probability space. The joint distribution is described by a joint PMF $P(X_1 = x_1, \ldots, X_d = x_d)$ in the discrete case or a joint density $p(x_1, \ldots, x_d)$ in the continuous case, with the obvious analogue of the unit-mass requirement: a sum or integral over $\mathbb{R}^d$ that equals one.

From the joint distribution you can recover marginals by summing or integrating out the other variables, for instance $p_{X_1}(x_1) = \int p(x_1, x_2)\, dx_2$, and conditionals by dividing the joint by the relevant marginal. Independence is the special case where the joint factorises into a product of marginals. We treat all of this in detail in §4.6, and the multivariate Gaussian, which puts these ideas to work, in §4.10.

What you should take away

A random variable is a function from outcomes to numbers. The randomness comes from which outcome occurs; once it does, the number is determined.
Discrete variables are described by a PMF: non-negative numbers that sum to one. Continuous variables are described by a PDF: a non-negative function whose integral is one. Densities are not probabilities.
The CDF $F(x) = P(X \le x)$ unifies the two cases and is the right object when you sample, when you want quantiles, or when distributions mix continuous and atomic parts.
Transformations of random variables follow the change-of-variables formula, with a Jacobian factor in the continuous case. This is the engine behind normalising flows and behind every reparameterisation trick.
Random vectors generalise everything from numbers to tuples; marginals come from summing or integrating out, conditionals from dividing by marginals, and independence is the case where the joint factorises.