Hypothesis Testing, Glossary, Textbook of AI

Hypothesis Testing provides a formal framework for making binary decisions under uncertainty. The framework begins with a null hypothesis $H_0$, typically the status quo or absence of effect, and an alternative hypothesis $H_1$ representing the claim to be evaluated. A test statistic is computed from the data and compared to its distribution under $H_0$. If the observed value falls in the rejection region, a pre-specified set of extreme values, we reject $H_0$ in favour of $H_1$; otherwise we fail to reject.

Two types of error are possible. A Type I error (false positive) rejects a true null hypothesis; its probability is the significance level $\alpha$, conventionally 0.05. A Type II error (false negative) fails to reject a false null hypothesis; its probability is $\beta$. The power of a test, $1 - \beta$, is the probability of correctly rejecting a false null. These concepts map directly onto the confusion matrix for ML classifiers.

Common tests include the t-test, chi-squared test, F-test, and non-parametric alternatives such as the Wilcoxon rank-sum test and permutation tests. The multiple-testing problem arises when many hypotheses are tested simultaneously; corrections like Bonferroni (family-wise error rate) and Benjamini–Hochberg (false discovery rate) control inflated error rates. In AI, hypothesis testing is used to assess whether the performance difference between two models is statistically significant, and permutation tests are increasingly preferred for their minimal distributional assumptions.

Interactive

Reject when the test statistic falls in the tail. Under the null hypothesis, the statistic has a known distribution. Extreme values lie in a small tail.

Video

Related terms: P-value, Confidence Interval

Discussed in:

Chapter 5: Statistics, Hypothesis Testing

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.