5.14 Causal Inference (Preview)

Statistical association is not causation. The mantra is so familiar that it is easy to forget how often it is violated, including by ML systems that appear to be making predictions but are silently relying on confounding structures that will not transfer.

Observational vs experimental

An experimental study assigns the treatment by the experimenter (ideally at random); an observational study merely observes which units happen to receive which treatment. Random assignment is the gold standard because it severs the link between the treatment and any pre-existing confounders.

In ML, A/B tests are randomised experiments; offline analysis of historical logs is observational. The two require different tools.

Confounding

A confounder is a variable that influences both the treatment and the outcome. If $Z$ is a confounder of $T \to Y$, then the observational association between $T$ and $Y$ mixes the causal effect with the spurious association induced by $Z$. The standard fix is to adjust for $Z$, by stratification, regression, matching, inverse probability weighting, or a structural causal model, but adjustment requires that you have measured $Z$ and that it is a legitimate confounder, not a mediator or collider. Misidentifying the role of a covariate can introduce bias rather than remove it.

Simpson's paradox, worked example

A hospital evaluates two treatments, A and B, for kidney stones (Charig et al., 1986).

Treatment A Treatment B
Small stones 81/87 (93%) 234/270 (87%)
Large stones 192/263 (73%) 55/80 (69%)
Combined 273/350 (78%) 289/350 (83%)

Within each subgroup of stone size, A has a higher success rate than B. Combined, B has a higher success rate than A. This is Simpson's paradox, and its resolution is that severity (stone size) is a confounder: A is preferentially used on the harder cases (large stones) and B on the easier ones, so the combined comparison conflates treatment effect with case difficulty.

In ML, Simpson's paradox arises every time a model is evaluated overall but deployed across heterogeneous subpopulations: a credit model with high overall accuracy may be markedly less accurate on a minority subgroup, an undesirable property masked by pooling. Subgroup analysis is essential for fair and reliable evaluation.

From statistics to causality

Modern causal inference, building on Pearl's structural causal models and the Neyman–Rubin potential outcomes framework, gives a formal language for asking and answering counterfactual questions. If we had assigned treatment differently, what would the outcome have been? Identifying causal effects from observational data requires assumptions (no unmeasured confounding, the SUTVA assumption, positivity) that must be argued for, not asserted.

A growing body of ML work, causal feature selection, double machine learning, causal forests, doubly-robust estimators, bridges the two fields. The glossary entry on causal inference is the entry point; for full treatment see Hernán and Robins's Causal Inference: What If, and Pearl, Glymour, and Jewell's Causal Inference in Statistics: A Primer.

The takeaway for AI: a model trained to predict $Y$ from $X$ on observational data will rely on whatever associations make prediction easier, including spurious confounder-induced ones. Distribution shift breaks these associations, and the model degrades. Building robust, transferable systems requires causal thinking even when the deployed model is purely predictive.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).