The p-value is the probability, under an assumed null hypothesis $H_0$, of obtaining a test statistic at least as extreme as the one actually observed:
$$p \;=\; \Pr\bigl(T \geq t_{\text{obs}} \,\big|\, H_0\bigr)$$
for a one-sided test, or the corresponding two-sided expression. A small p-value indicates that the observed data are unlikely under $H_0$ and so provide evidence against it. Conventionally, a p-value below the significance level $\alpha$ (typically 0.05) leads to rejection of the null.
History
The concept dates to Ronald A. Fisher's Statistical Methods for Research Workers (1925), where the 0.05 threshold was first proposed , explicitly as a convention, not a law of nature. Fisher saw the p-value as a measure of evidence against the null in a single experiment. The competing Neyman–Pearson framework (1933) reinterpreted hypothesis testing as a decision procedure with controlled long-run Type I and Type II error rates. The hybrid taught in modern statistics courses awkwardly combines both.
Common misinterpretations
The p-value is one of the most misinterpreted concepts in all of statistics. Crucially, the p-value is not:
- The probability that the null hypothesis is true. That is the posterior probability $\Pr(H_0 \mid \text{data})$, which requires a prior and Bayes' theorem.
- The probability that the observed result is due to chance. Under $H_0$, all of the result is "due to chance" by definition.
- A measure of effect size. A p-value below 0.05 says nothing about whether the effect is large, small, or trivial.
- Strong evidence on its own. With many tests run, low p-values occur by chance, the multiple-comparisons problem and the garden of forking paths.
Statistical significance does not imply practical significance: with sufficiently large sample sizes, even a trivially small effect yields a vanishingly small p-value. A p-value of 0.04 is not meaningfully different from a p-value of 0.06, despite the former being "significant" and the latter not.
The 2016 ASA statement
In response to growing concern about misuse, the American Statistical Association issued a formal statement in 2016, the first such pronouncement in the ASA's history, warning against mechanical use of p-values and calling for richer reporting practices. Six principles were articulated, including: "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."
In machine learning and AI
The machine-learning community has increasingly moved away from rigid p-value thresholds in favour of reporting effect sizes, confidence intervals, bootstrap distributions, and Bayesian posterior probabilities. The replication crisis in psychology and the social sciences (Open Science Collaboration, 2015) has reinforced this shift. Within ML benchmarking, papers are increasingly expected to report results across multiple seeds with confidence intervals rather than single point estimates.
Nonetheless, p-values remain prevalent in clinical AI and pharmacovigilance, where regulatory frameworks (FDA, EMA) mandate randomised controlled trials with pre-specified Type I error rates and confirmatory hypothesis tests. Understanding both the mechanics and the limitations of p-values is therefore essential for any practitioner who must interpret or communicate statistical results to clinicians, regulators, or the public.
Interactive
Video
Related terms: Hypothesis Testing, Confidence Interval, Bayesian Inference
Discussed in:
- Chapter 6: ML Fundamentals, Statistical Inference