Probability: 4.8   Inequalities

Dr Chris Paton

4.8 Inequalities

In probability, an inequality is a statement that pins down how much of the probability mass can sit in a particular place, without us having to know the exact distribution. Most of the time we cannot compute a tail probability such as $P(X \ge 100)$ exactly, because we do not know the full distribution of $X$. But if we know just one or two simple summaries, the mean, perhaps the variance, perhaps that the values are bounded, an inequality gives us a guaranteed upper bound on that tail. The bound may be loose, but it is always true. That guarantee is what makes inequalities the workhorse of theory: they convert vague phrases such as "with high probability" into concrete, finite-sample statements like "the empirical mean of a thousand independent samples differs from the true mean by more than 0.05 with probability at most 1.3%".

A second use is concentration. When we average many independent random quantities, the average tends to cluster tightly around its expectation. The inequalities below quantify how tight that cluster is, for instance, the probability that the average of $n$ independent samples wanders more than $t$ from the true mean falls off exponentially in $n$. Concentration is the mathematical reason that statistical learning works: a finite training set is enough to estimate something true about the underlying distribution.

In §4.7 we introduced expectation and variance. Section 4.8 now uses those summaries to bound probabilities. In §4.9 we will let the sample size grow and recover the law of large numbers and the central limit theorem as limiting consequences.

Symbols Used Here

$X$random variable

$\mathbb{E}[X]$expectation

$\text{Var}(X)$variance

$|X|$absolute value

$P(\cdot)$probability

Markov's inequality

For any non-negative random variable $X \ge 0$ and any positive threshold $a > 0$: $$P(X \ge a) \le \frac{\mathbb{E}[X]}{a}.$$

In words: the probability that $X$ is at least as large as $a$ cannot exceed the mean of $X$ divided by $a$. It is the most basic of the probability inequalities, and every other bound in this section is built out of it.

The proof is one line. Because $X$ is non-negative, $X \ge a \cdot \mathbb{1}\{X \ge a\}$, where the indicator is 1 when the event holds and 0 otherwise. Take expectations of both sides: $\mathbb{E}[X] \ge a \cdot P(X \ge a)$. Rearrange.

Worked example. Suppose the average exam score across a cohort is 70 out of 100. What is the probability that a randomly chosen student scored at least 90? Markov gives $P(X \ge 90) \le 70/90 \approx 0.78$. That is a very weak bound, it allows for the possibility that 78% of the class scored 90 or above. In reality the number is almost certainly far smaller. But Markov earned its bound for free: we used only the mean. We assumed nothing about the distribution's shape, its variance, or its support. Markov is a blunt instrument, but it always applies when $X \ge 0$, and it costs nothing.

The inequality fails for random variables that can go negative, a quantity with mean zero can have a huge tail in either direction without contradiction. To handle two-sided tails we apply Markov to a non-negative function of $X$, such as $(X - \mu)^2$. That is the trick that produces Chebyshev.

Chebyshev's inequality

For any random variable with finite mean $\mu$ and finite variance $\sigma^2$, and any $k > 0$: $$P(|X - \mu| \ge k\sigma) \le \frac{1}{k^2}.$$

In words: the probability of being more than $k$ standard deviations from the mean is at most $1/k^2$, regardless of the distribution.

The derivation is short. Apply Markov's inequality to the non-negative variable $(X - \mu)^2$ at the threshold $a = k^2 \sigma^2$: $$P((X - \mu)^2 \ge k^2 \sigma^2) \le \frac{\mathbb{E}[(X - \mu)^2]}{k^2 \sigma^2} = \frac{\sigma^2}{k^2 \sigma^2} = \frac{1}{k^2}.$$

The event $(X - \mu)^2 \ge k^2 \sigma^2$ is the same as $|X - \mu| \ge k \sigma$, which gives the Chebyshev bound.

Worked example. Set $k = 3$. Chebyshev says that for any distribution with finite variance, no more than $1/9 \approx 0.111$ of the probability mass can lie more than three standard deviations from the mean. For a Gaussian, the actual three-sigma tail is $\approx 0.0027$, about forty times smaller. So Chebyshev is loose by a wide margin when the distribution has thin tails, but it makes no Gaussian assumption. It works for a heavy-tailed distribution, a discrete distribution, or any wild distribution we care to construct, as long as $\sigma^2$ is finite.

That distribution-freeness is the point. In machine learning we frequently bound the deviation of an empirical estimator from its true value without knowing the estimator's distribution. Chebyshev is enough to do that. It is also the easiest inequality to use as a sanity check: if a method's variance is bounded, Chebyshev immediately gives a polynomial-in-$1/\delta$ confidence interval. Setting the right-hand side equal to a target $\delta$ and solving gives $k = 1/\sqrt{\delta}$, so a 95% interval ($\delta = 0.05$) covers $\mu \pm \sqrt{20}\,\sigma \approx \mu \pm 4.47\,\sigma$. Conservative, yes; but always valid.

Jensen's inequality

For any convex function $f$ and any random variable $X$ for which both expectations exist: $$f(\mathbb{E}[X]) \le \mathbb{E}[f(X)].$$

If $f$ is concave, the inequality reverses: $f(\mathbb{E}[X]) \ge \mathbb{E}[f(X)]$.

Geometrically, a convex function lies below every chord of its graph, so applying $f$ first and then averaging picks up the curvature, while averaging first and then applying $f$ does not. The gap between the two sides is exactly the curvature-weighted dispersion of $X$.

Worked example. Take $f(x) = x^2$, which is convex. Jensen gives $(\mathbb{E}[X])^2 \le \mathbb{E}[X^2]$. Rearrange to $\mathbb{E}[X^2] - (\mathbb{E}[X])^2 \ge 0$, and the left-hand side is exactly $\text{Var}(X)$. So Jensen recovers the elementary fact that variance is non-negative, with equality only when $X$ is a constant.

Where this matters in AI:

Evidence lower bound (ELBO). Variational inference writes $\log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z}) d\mathbf{z}$ and applies Jensen to the concave $\log$ to bound the marginal likelihood from below by an integral that is tractable. Maximising the ELBO over a variational distribution $q(\mathbf{z})$ is the foundational manoeuvre behind the variational autoencoder.
KL divergence is non-negative. $\text{KL}(p \| q) = \mathbb{E}_p[\log(p/q)] = -\mathbb{E}_p[\log(q/p)] \ge -\log \mathbb{E}_p[q/p] = -\log 1 = 0$, where the inequality is Jensen applied to the concave $\log$. KL is therefore a valid (asymmetric) measure of distance between distributions.
Log-sum-exp tricks. The convexity of $\exp$ underwrites the numerical stabilisation tricks used to compute softmax and cross-entropy without overflow.

Jensen connects pure curvature to information-theoretic identities, and its consequences are everywhere in modern probabilistic deep learning.

Hoeffding's inequality

For independent bounded random variables $X_1, \ldots, X_n$ with $a_i \le X_i \le b_i$, and the sample mean $\bar X = \frac{1}{n}\sum_{i=1}^{n} X_i$: $$P\!\left(|\bar X - \mathbb{E}[\bar X]| \ge t\right) \le 2 \exp\!\left(-\frac{2 n^2 t^2}{\sum_{i=1}^{n} (b_i - a_i)^2}\right).$$

This is the workhorse concentration inequality. Where Chebyshev gives a polynomial $1/k^2$ tail, Hoeffding gives an exponential one, provided the summands are bounded. The bound is the foundation of generalisation theory in machine learning: it lets you say that the difference between training error and true error is, with high probability, no more than a quantity that shrinks as $1/\sqrt{n}$.

For the special case where each $X_i \in [0,1]$, the formula simplifies to $$P(|\bar X - \mu| \ge t) \le 2 \exp(-2 n t^2),$$ which is the form most often seen in practice.

Worked example. Flip a coin $n$ times and use $\bar X$ to estimate the unknown probability $p$ of heads. We want the estimate to be within $\pm 0.05$ of the truth.

At $n = 100$: $P(|\bar X - p| \ge 0.05) \le 2 \exp(-2 \cdot 100 \cdot 0.05^2) = 2 \exp(-0.5) \approx 1.21$. The bound exceeds 1, so it is vacuous, Hoeffding tells us nothing useful at this sample size.

At $n = 1000$: $P(|\bar X - p| \ge 0.05) \le 2 \exp(-2 \cdot 1000 \cdot 0.05^2) = 2 \exp(-5) \approx 0.013$. Now the bound is meaningful: with probability at least 98.7%, a thousand flips put us within five percentage points of the true probability. The exponential decay is the reason data efficiency in learning theory feels so much better than Chebyshev would suggest.

Hoeffding underwrites PAC learning bounds, the upper-confidence-bound (UCB) family of bandit algorithms, and most non-asymptotic confidence intervals in modern statistics.

Cauchy-Schwarz inequality

For any random variables with finite second moments: $$|\mathbb{E}[XY]| \le \sqrt{\mathbb{E}[X^2]\, \mathbb{E}[Y^2]}.$$

Apply this to the centred variables $X - \mu_X$ and $Y - \mu_Y$ and divide by $\sigma_X \sigma_Y$, and you obtain $|\rho_{XY}| \le 1$, the correlation coefficient is always between $-1$ and $+1$. The same inequality bounds inner products in any inner-product space, which is why it appears whenever angles, projections, or kernel similarities are at stake. In AI, it gives the immediate fact that cosine similarity is bounded, and it shows up in the analysis of attention, kernel methods, and any moment-matching estimator.

Where inequalities appear in AI

Generalisation bounds (PAC learning). Hoeffding and union bounds turn finite-sample empirical risk into a meaningful estimate of true risk over a hypothesis class.
Concentration of stochastic gradients. SGD analyses use Hoeffding-type or martingale concentration to show that a mini-batch gradient is close to the full-data gradient with high probability.
Martingale methods in optimisation. Azuma-Hoeffding and Freedman's inequality extend concentration to dependent sequences and underpin the convergence proofs for adaptive optimisers.
KL $\ge 0$ from Jensen. Variational inference, mutual information bounds, and the non-negativity of cross-entropy minus entropy all flow from a single application of Jensen to $\log$.
Convergence proofs for SGD. Convex analysis combines Jensen-type inequalities with Markov to give explicit rates of convergence in expectation and in probability.

These tools are not a separate branch of probability, they are how almost every theoretical guarantee in modern machine learning is actually proved.

What you should take away

Markov gives a free bound when only the mean is known: $P(X \ge a) \le \mathbb{E}[X]/a$ for non-negative $X$. Loose, but it always holds.
Chebyshev gives a $1/k^2$ tail for any distribution with finite variance. Distribution-free; a polynomial bound.
Jensen relates curvature to expectation. Convex pulls the function inside the expectation upward; concave pulls it down. This single fact underwrites the ELBO and KL $\ge 0$.
Hoeffding gives exponential concentration when the summands are bounded. This is the engine of generalisation theory and PAC learning.
Cauchy-Schwarz bounds inner products and correlations, it is why cosine similarity lives in $[-1, 1]$ and why moment-matching estimators behave.