P-value, Glossary, Textbook of AI

The p-value is the probability, under an assumed null hypothesis $H_0$, of obtaining a test statistic at least as extreme as the one actually observed:

$$p \;=\; \Pr\bigl(T \geq t_{\text{obs}} \,\big|\, H_0\bigr)$$

for a one-sided test, or the corresponding two-sided expression. A small p-value indicates that the observed data are unlikely under $H_0$ and so provide evidence against it. Conventionally, a p-value below the significance level $\alpha$ (typically 0.05) leads to rejection of the null.

History

The concept dates to Ronald A. Fisher's Statistical Methods for Research Workers (1925), where the 0.05 threshold was first proposed , explicitly as a convention, not a law of nature. Fisher saw the p-value as a measure of evidence against the null in a single experiment. The competing Neyman–Pearson framework (1933) reinterpreted hypothesis testing as a decision procedure with controlled long-run Type I and Type II error rates. The hybrid taught in modern statistics courses awkwardly combines both.

Common misinterpretations

The p-value is one of the most misinterpreted concepts in all of statistics. Crucially, the p-value is not:

The probability that the null hypothesis is true. That is the posterior probability $\Pr(H_0 \mid \text{data})$, which requires a prior and Bayes' theorem.
The probability that the observed result is due to chance. Under $H_0$, all of the result is "due to chance" by definition.
A measure of effect size. A p-value below 0.05 says nothing about whether the effect is large, small, or trivial.
Strong evidence on its own. With many tests run, low p-values occur by chance, the multiple-comparisons problem and the garden of forking paths.

Statistical significance does not imply practical significance: with sufficiently large sample sizes, even a trivially small effect yields a vanishingly small p-value. A p-value of 0.04 is not meaningfully different from a p-value of 0.06, despite the former being "significant" and the latter not.

The 2016 ASA statement

In response to growing concern about misuse, the American Statistical Association issued a formal statement in 2016, the first such pronouncement in the ASA's history, warning against mechanical use of p-values and calling for richer reporting practices. Six principles were articulated, including: "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."

In machine learning and AI

The machine-learning community has increasingly moved away from rigid p-value thresholds in favour of reporting effect sizes, confidence intervals, bootstrap distributions, and Bayesian posterior probabilities. The replication crisis in psychology and the social sciences (Open Science Collaboration, 2015) has reinforced this shift. Within ML benchmarking, papers are increasingly expected to report results across multiple seeds with confidence intervals rather than single point estimates.

Nonetheless, p-values remain prevalent in clinical AI and pharmacovigilance, where regulatory frameworks (FDA, EMA) mandate randomised controlled trials with pre-specified Type I error rates and confirmatory hypothesis tests. Understanding both the mechanics and the limitations of p-values is therefore essential for any practitioner who must interpret or communicate statistical results to clinicians, regulators, or the public.

Interactive

Reject when the test statistic falls in the tail. Under the null hypothesis, the statistic has a known distribution. Extreme values lie in a small tail.

Video

Discussed in:

Chapter 6: ML Fundamentals, Statistical Inference

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.