5.8 Hypothesis Testing

Hypothesis testing is the formal procedure for asking a single, blunt question: is the difference I am looking at real, or could it plausibly be noise? Every clinical trial that compares a new drug against a placebo, every A/B test that pits a redesigned checkout button against the old one, and every benchmark table that claims one model beats another, leans on this framework. It gives a disciplined way to convert a vague "looks better to me" into a number that other people can argue with.

The framework is also one of the most widely misused tools in modern science. Whole subfields have had to rerun studies because researchers misread what a p-value means, or kept testing until something looked significant. This section gives the mechanics so that you can read a results table without being bluffed, and it flags the most common pitfalls so you do not commit them yourself.

Where confidence intervals (§5.7) ask "what values are plausible?", hypothesis testing asks "is this particular value plausible?". The two are linked: a confidence interval that excludes a hypothesised value is equivalent to rejecting that value at the matching significance level.

Symbols Used Here
$H_0$null hypothesis
$H_1$alternative hypothesis
$\alpha$significance level (typically 0.05)
$\beta$type II error rate
$1 - \beta$power
$p$-valueprobability under $H_0$ of seeing data at least as extreme
$T$test statistic

The framework

The procedure has four steps and they always run in the same order.

  1. State $H_0$ and $H_1$. By convention, $H_0$ is the boring claim, "no effect", "no difference", "the new model is no better than the old one". $H_1$ is the interesting claim that the researcher is hoping to support. The asymmetry is deliberate: the framework starts by assuming nothing is going on and only abandons that assumption if the evidence is strong enough.
  2. Compute a test statistic $T$. This is a single number summarising how far the data have moved from what $H_0$ would predict. A common shape is "observed minus expected, divided by standard error", which measures the distance in noise units.
  3. Compute the $p$-value. The $p$-value is $P(T \ge t_{\text{observed}} \mid H_0)$, the probability of seeing a test statistic at least as extreme as the one you got, assuming the null is true. Small $p$ means the data would be surprising if the null were correct.
  4. Compare to $\alpha$. If $p < \alpha$, reject $H_0$. The conventional choice $\alpha = 0.05$ is a historical accident, not a law of nature; some fields use 0.01 or 0.001.

A worked example. You run an A/B test on a checkout page. The control group, with 1000 visitors, converts at 4%. The treatment group, also 1000 visitors, converts at 5%. Your manager wants to know whether the treatment is genuinely better, or whether the gap is the kind of wobble you would see between any two random samples of the same population.

The null hypothesis $H_0$ is that the two true conversion rates are equal; the apparent gap is sampling noise. The alternative $H_1$ is that the treatment really is better. The test statistic compares the two observed proportions, scaled by their standard error. We will compute it precisely later in the section. For now, observe that even before any arithmetic, two questions matter: how big is the gap, and how noisy is the measurement? A one percentage point gap could be very convincing if the noise floor is tiny, and entirely unconvincing if the noise floor is large. The whole machinery of hypothesis testing is just a careful way of working out how large the noise floor is for your specific sample size, and asking whether the observed gap is comfortably above it.

A subtle but important point: failing to reject $H_0$ is not the same as proving it true. The framework only ever says "the data are not surprising enough to abandon the null"; it never says "the null is correct". This asymmetry matches the spirit of scientific scepticism, extraordinary claims require extraordinary evidence, but it routinely confuses people who report a non-significant result as "no effect".

Two types of error

Any binary decision based on noisy data can go wrong in two ways. The framework names them.

Reject $H_0$ Fail to reject
$H_0$ true Type I error (false positive) Correct
$H_1$ true Correct (true positive) Type II error (false negative)

A Type I error is a false alarm: you declare an effect when there is none. The significance level $\alpha$ is, by construction, the probability of a Type I error when $H_0$ is true. Setting $\alpha = 0.05$ is a promise: if I run many tests on data where nothing is really happening, I will incorrectly cry "effect!" about 5% of the time.

A Type II error is a missed signal: a real effect exists, but the test failed to find it. Its probability is $\beta$. The complement, $1 - \beta$, is power, the probability of correctly detecting a real effect.

Power depends on four things: the size of the true effect, the sample size, the variability of the measurement, and the chosen $\alpha$. Bigger effects, larger samples, less noise, and a more permissive threshold all push power up.

There is an unavoidable tradeoff. If you tighten $\alpha$ from 0.05 to 0.001 to be stricter about false alarms, you will catch fewer of the real effects too, power drops. If you loosen $\alpha$, you catch more real effects but also raise more false alarms. There is no setting that makes both error types small without also growing the sample size.

This is why power analysis matters before you collect data. Conventional practice is to aim for power of at least 0.80, an 80% chance of detecting the effect you care about, if it exists. Underpowered studies waste effort: they fail to find effects that are really there, leaving the researcher unable to distinguish "no effect" from "I did not look hard enough". In machine learning, where it is tempting to compare models on a handful of seeds, this trap is everywhere. A new model that is genuinely 0.3% better on a benchmark with a 0.5% seed-to-seed wobble will fail to reach significance with three runs, no matter how careful the rest of the pipeline is.

The asymmetry between $\alpha$ and $\beta$ is also worth dwelling on. Convention treats Type I errors as the more serious offence, a regulator does not want to approve a useless drug, and so $\alpha$ is fixed in advance at a small value while $\beta$ is allowed to float. In settings where missing a real effect is the costlier mistake (screening tests, safety-critical alerts), this default deserves to be questioned.

One-sample tests

The simplest setting is comparing a single sample mean to a hypothesised value $\mu_0$.

When the population variance $\sigma^2$ is known (rare, but it happens with calibrated instruments), the z-test applies:

$$T = \frac{\bar X - \mu_0}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1) \text{ under } H_0.$$

When the variance is estimated from the data, the usual case, you must use the t-test:

$$T = \frac{\bar X - \mu_0}{\hat\sigma / \sqrt{n}} \sim t_{n-1} \text{ under } H_0.$$

The denominator is the standard error of the mean. The t-distribution has heavier tails than the normal, reflecting the extra uncertainty introduced by estimating $\sigma$. As $n$ grows, the t-distribution converges to the normal, and for $n > 30$ or so the two give nearly identical answers.

Worked example. You have $n = 25$ measurements with sample mean $\bar X = 5.2$ and sample standard deviation $\hat\sigma = 1.0$. You want to test $H_0: \mu = 5$ against the two-sided alternative $H_1: \mu \neq 5$.

The standard error is $\hat\sigma / \sqrt{n} = 1.0 / 5 = 0.2$. The test statistic is

$$T = \frac{5.2 - 5}{0.2} = \frac{0.2}{0.2} = 1.0.$$

With 24 degrees of freedom, the two-sided p-value for $T = 1.0$ is approximately $0.327$. That is much larger than 0.05, so we fail to reject $H_0$. The data are entirely consistent with $\mu = 5$; the gap between 5.0 and 5.2 is exactly the kind of wobble we would expect from a sample of 25 with this much spread.

This example also illustrates why effect size and sample size both matter. The same $\bar X = 5.2$ and $\hat\sigma = 1.0$ with $n = 400$ would give $T = 4.0$ and a tiny p-value, the difference would now be far above the noise floor. The data have not changed; only our certainty about them has.

Two-sample tests

Most real applications compare two groups, not a single group against a fixed value. The two-sample t-test asks whether two means differ:

$$T = \frac{\bar X_1 - \bar X_2}{\text{SE}(\bar X_1 - \bar X_2)}.$$

Under $H_0$ (equal means), $T$ is approximately t-distributed. The standard error in the denominator combines the variability of the two groups. A common form, the pooled standard error, assumes both groups have the same true variance and combines them; Welch's t-test drops that assumption and is the safer default when group variances may differ.

For comparing two proportions (as in A/B testing), the formula simplifies. With sample sizes $n_A, n_B$ and observed proportions $\hat p_A, \hat p_B$, the test statistic is

$$T = \frac{\hat p_A - \hat p_B}{\sqrt{\hat p (1 - \hat p) (1/n_A + 1/n_B)}},$$

where $\hat p$ is the pooled proportion across both groups.

Returning to the A/B test from earlier: $n_A = n_B = 1000$, $\hat p_A = 0.05$, $\hat p_B = 0.04$. The pooled proportion is $\hat p = 0.045$. The standard error is

$$\sqrt{0.045 \times 0.955 \times (1/1000 + 1/1000)} = \sqrt{0.045 \times 0.955 \times 0.002} \approx 0.00927.$$

The test statistic is $T = 0.01 / 0.00927 \approx 1.08$, which corresponds to a two-sided p-value of about 0.28. The one percentage point difference is not statistically significant at $n = 1000$. The data are consistent with the null, and a wise team would either run the test for longer or accept that the design change has no detectable effect.

When the same experimental units are measured twice (the same patients before and after, or the same benchmark items scored by two models), use a paired test. Pairing controls for unit-level variation and dramatically increases power; running an unpaired test on paired data wastes information.

Multiple testing correction

Suppose you test 100 independent hypotheses, all at $\alpha = 0.05$, in a setting where every null is true. By construction, you expect about 5 of them to come out "significant" purely by chance. Run 1000 such tests and you should expect about 50 false positives. This is the multiple testing problem: the more comparisons you run, the more spurious findings you collect.

Several corrections control this.

The Bonferroni correction is the bluntest: divide $\alpha$ by the number of tests $m$, then declare each $H_i$ significant only if $p_i < \alpha / m$. This controls the family-wise error rate, the probability of any false positive across the family, at $\alpha$. It is conservative; with thousands of tests it can become so strict that real effects get missed.

The Benjamini–Hochberg procedure controls the false discovery rate: the expected fraction of declared findings that are wrong. Rank the $p$-values from smallest to largest and accept all those below $i \alpha / m$ for rank $i$. FDR control allows more discoveries than Bonferroni, accepting that a small fraction will be false, which is sensible in exploratory settings.

Holm–Bonferroni is a sequential refinement: it applies Bonferroni-style thresholds in rank order, gaining a little power without losing family-wise error control.

These corrections are everywhere. Genome-wide association studies test millions of variants and would otherwise be a sea of false positives. Large clinical trials with secondary endpoints adjust to avoid claiming effects that arose by chance. A/B testing platforms running hundreds of simultaneous experiments need FDR control to keep the discovery stream calibrated. In machine learning, multiple testing arises whenever you run a hyperparameter sweep, evaluate per-class accuracy across many classes, or compare a new method against many baselines, every additional comparison is another chance to declare a winner that does not generalise.

A practical consequence: keep a final test set untouched until the very end, fix your evaluation protocol before looking at the data, and report the family-wise or false-discovery-rate adjusted significance whenever you make many comparisons. None of this prevents discovery; it just keeps the rate of fake discoveries under control.

Common pitfalls

The framework is robust on paper and fragile in practice. Five abuses come up again and again.

$p$-hacking. Running many tests, fishing for one that crosses the threshold, then reporting only that one. Variants include trying many subgroups, many outcome definitions, or many model specifications until something works. Every degree of researcher freedom inflates the false-positive rate beyond the nominal $\alpha$.

Confusing $p$ with $P(H_0 \mid \text{data})$. A p-value is the probability of the data under the null; it is not the probability that the null is true given the data. The American Statistical Association's 2016 guidance is unusually direct on this point: p-values do not measure the probability of a hypothesis. Confusing the two is a category error that has propagated through textbooks and press releases for decades.

Statistical versus practical significance. With a large enough sample, vanishingly small effects become "significant". A drug that lowers blood pressure by 0.1 mmHg in a trial of a million people will produce an emphatic p-value. Whether anyone should care is a separate question. Always pair a p-value with an effect size and a confidence interval.

Pre-registration. Declare your hypotheses, sample size, and analysis plan before you collect data. This stops the unconscious slide from "exploring" to "confirming" within a single dataset, which is the engine room of $p$-hacking.

Power before testing. Aim for at least 80% power against the smallest effect that would matter. An underpowered study cannot conclude "there is no effect"; it can only conclude "I could not see one". Many famous "failed replications" are simply original studies that lacked the power to find the effect they claimed.

Bayesian alternatives

What people usually want from a hypothesis test is a statement like "given the data, how probable is each hypothesis?". That is a Bayesian question, and frequentist p-values do not answer it.

The Bayesian alternative is the Bayes factor, the ratio of marginal likelihoods under the two hypotheses:

$$\operatorname{BF}_{10} = \frac{p(D \mid H_1)}{p(D \mid H_0)}.$$

A Bayes factor of 10 means the data are ten times more likely under $H_1$ than under $H_0$. Combined with prior odds, it gives posterior odds, the answer to the question that p-values are routinely misread as answering. Bayes factors can also support the null directly: a value much less than 1 is positive evidence for $H_0$, something a p-value can never provide.

Bayes factors penalise complexity automatically through the marginal likelihood, which averages over the prior, flexible models that could fit anything pay for that flexibility. This is Occam's razor built into the arithmetic: a model that hedges its bets across a wide prior cannot also concentrate prediction mass on the data, and so loses to a tighter model that gets the data right. Some clinical-trial regulators now accept Bayesian designs, and they are spreading in machine learning model comparison, particularly in Gaussian processes and Bayesian deep learning where evidence-based selection is becoming standard practice. The cost is computational, marginal likelihoods are usually intractable and require Monte Carlo or variational approximation, but for small, principled comparisons the answer they give is the one practitioners actually wanted from the start.

What you should take away

  1. Hypothesis testing is a four-step ritual: state $H_0$ and $H_1$, compute a test statistic, compute a $p$-value, compare to $\alpha$.
  2. The $p$-value is $P(T \ge t \mid H_0)$, not the probability that the null is true.
  3. There are two error types, false positives ($\alpha$) and false negatives ($\beta$), and you cannot shrink both without growing the sample.
  4. With many tests, false positives accumulate; correct with Bonferroni or Benjamini–Hochberg.
  5. Statistical significance is not practical significance; always pair p-values with effect sizes and confidence intervals.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).