Ethics & Safety: 16.14 Bias and fairness

Dr Chris Paton

16.14 Bias and fairness

A loan-approval model trained on a decade of historical lending decisions does not learn the abstract concept of creditworthiness; it learns to predict which applicants the bank used to approve. If the bank's past decisions were shaped by redlining, by loan officers who weighted ethnic-sounding surnames against applicants, or by branch managers whose intuitions tracked their personal social circle, then the model inherits those preferences as its target. A face-recognition system trained on a corpus that is 80% lighter-skinned and 70% male will, with statistical inevitability, perform worse on darker-skinned women, because the loss function it minimised never required it to do otherwise. A hospital readmission risk model trained on prior healthcare spending will systematically underestimate the needs of patients who have historically had less access to care, because the proxy it uses for need is itself a measure of access. None of these failures requires malice from the engineer. Each falls out of fitting a model to data that records the world as it was rather than as we would like it to be.

This is the matter §16.14 takes up. §16.1 framed alignment as the general problem of getting a system to do what we want; bias and fairness are the specific instance in which the gap between what the system optimises and what we want has measurable demographic consequences. The framework here is older and tighter than the rest of the alignment literature, it inherits decades of work from statistics, employment law and credit-scoring regulation, and the mathematics, once stated, is unusually clean. The hard part, as the sections below will show, is not the algebra. It is that the algebra forces a choice the engineer would prefer not to make.

Symbols Used Here

$Y$true label

$\hat Y$prediction

$A$protected attribute (race, gender, age)

Group fairness criteria

The simplest fairness criteria all have the same shape: pick a quantity that the model assigns to individuals, and require its conditional distribution to be invariant under the protected attribute. The choice of which quantity, and which conditioning, gives three definitions that look superficially similar and turn out to be deeply different.

Demographic parity asks that the rate at which the model issues positive decisions be the same across groups: $P(\hat Y = 1 \mid A = a) = P(\hat Y = 1 \mid A = b)$. If the model approves loans for 30% of white applicants, it should approve loans for 30% of black applicants. The criterion is intuitive, equal selection rates, and aligns with the legal concept of disparate impact in US employment law (the "four-fifths rule"). Its weakness is that it ignores the ground truth entirely. If the base rates of qualified applicants genuinely differ across groups (perhaps because of historical disinvestment that the model is not the right tool to fix), demographic parity forces the model to make worse decisions to satisfy the constraint.

Equal opportunity restricts the comparison to the qualified: $P(\hat Y = 1 \mid Y = 1, A = a) = P(\hat Y = 1 \mid Y = 1, A = b)$. The true-positive rate is equal across groups. Among applicants who would in fact repay the loan, the same proportion get approved regardless of group. This concedes the existence of group-level differences in $Y$ and asks only that the model not compound them by missing qualified candidates more often in one group than another. It is an attractive compromise when the cost of false negatives is the dominant fairness concern.

Equalised odds strengthens equal opportunity by requiring equal false-positive rates as well: the model should make the same kinds of mistakes at the same rates in each group. This matters when false positives have a real cost to the individual, pretrial risk assessment, fraud screening, child-welfare investigations, and is the criterion most often used in the criminal-justice literature.

A fourth criterion, calibration, looks at the score rather than the decision: among individuals to whom the model assigns risk score $s$, the actual rate of positive outcomes should be the same across groups. A "70% chance of repayment" should mean 70% in every group. Calibration is what statisticians and actuaries had usually meant by a fair score before the modern fairness literature began, and it is what most commercial vendors mean when they say their model is "validated".

It is worth noticing that these definitions disagree about whose perspective fairness is being defined from. Demographic parity speaks from the perspective of the population: equal selection across groups regardless of underlying outcomes. Equal opportunity and equalised odds speak from the perspective of the qualified or the unqualified: equal treatment among those who would have done well, or equal treatment among those who would not. Calibration speaks from the perspective of the individual receiving the score: my score should mean what the model says it means, regardless of which group I am in. Each definition encodes a different theory of what fair treatment is, and the formal mathematics that follows is downstream of that disagreement, not upstream of it.

The Kleinberg–Chouldechova impossibility

Two papers in 2016–17, Kleinberg, Mullainathan and Raghavan's Inherent Trade-Offs and Chouldechova's Fair Prediction with Disparate Impact, proved, in different formulations, that calibration and equalised odds cannot both hold whenever the base rate of $Y$ differs between groups and the model is imperfect. The two stated conditions are very weak: real-world base rates almost always differ between groups (because the world is structured by history), and no model is perfect. So the impossibility bites in nearly every realistic deployment.

Chouldechova's proof fits in a paragraph. Calibration fixes the positive predictive value across groups: among those the model flagged, the same fraction are actually positive. The standard relationship between PPV, base rate $p$, true-positive rate (TPR) and false-positive rate (FPR) is $\mathrm{FPR} = \frac{p}{1-p}\cdot\frac{1-\mathrm{PPV}}{\mathrm{PPV}}\cdot\mathrm{TPR}$. If PPV and TPR are equal across groups but $p$ is not, then FPR cannot be equal. The same algebra forces FNR to differ. There is no escape inside the model: changing the threshold changes both sides of the equality at once.

The practical implication is that the engineer must choose. Calibration, equalised odds and demographic parity are mutually inconsistent under realistic conditions, and there is no neutral default. Choosing calibration says, in effect, that the score should mean the same thing to each individual regardless of group; choosing equalised odds says that the model's errors should be distributed equally across groups; choosing demographic parity says that the social cost of unequal selection rates outweighs both. These are normative claims, not technical ones. The fairness literature's most uncomfortable contribution is the demonstration that the choice cannot be evaded by being more careful with the maths.

Worked example: COMPAS

ProPublica's Machine Bias investigation in 2016 examined COMPAS, a recidivism risk score used by US courts to inform pretrial detention decisions. The journalists obtained risk scores and two-year reoffence outcomes for around 7,000 defendants in Broward County, Florida, and compared error rates across racial groups. They found that black defendants who did not reoffend were classified as high risk at roughly twice the rate of white defendants who did not reoffend (an FPR disparity), and that white defendants who did reoffend were classified as low risk at twice the rate of black defendants who did reoffend (an FNR disparity). The headline conclusion was that COMPAS was biased against black defendants.

Northpointe, the vendor, replied with a different statistic. Among defendants the model assigned a given risk score, the actual reoffence rate was approximately equal across races: a "high-risk" black defendant and a "high-risk" white defendant had roughly the same probability of reoffending. By the calibration criterion, the model was fair. The vendor's defence and the journalists' critique were both arithmetically correct. They were measuring different things, and as Chouldechova's paper made explicit shortly afterwards, the things they were measuring could not both be equalised across groups when the base rates of reoffence differed between them, which they did.

COMPAS thus became the textbook example of fairness as a value choice rather than a technical specification. Neither side was lying with statistics; each was insisting on a definition that mapped to a different moral intuition. ProPublica was implicitly arguing that the model should not make worse mistakes against one group; Northpointe was implicitly arguing that the score should mean the same thing across groups. The case still divides courts, criminologists and statisticians, and that disagreement is the point. A practitioner reading the case for the first time is often tempted to look for the resolving statistic that proves one side correct; the impossibility theorem is a polite way of saying that no such statistic exists.

Sources of bias

Locating the source of a fairness gap is a precondition for fixing it; the mitigations compose poorly across categories.

Sampling bias: the training set under-represents the deployment population. Buolamwini and Gebru's 2018 Gender Shades audit showed commercial face-classification systems with error rates of 0.8% on lighter-skinned men and 34.7% on darker-skinned women, traceable directly to the demographic skew of the training corpora. The fix is dataset composition, not algorithmic.
Label bias: the ground truth itself records historical discrimination. Amazon's CV-screening tool, scrapped in 2018, learned to penalise CVs containing the token "women's" because the historical hiring panel had not hired most of the women who applied. The labels were not noise around a fair signal; they were the record of a biased process. No reweighting can recover labels that were never produced.
Feature bias: ostensibly neutral features correlate with protected attributes. Postcode encodes race and class in most countries; vocabulary encodes gender and age; medical-spending history encodes prior access to healthcare (Obermeyer et al. 2019). Removing the protected attribute does nothing if its proxies remain.
Model bias: the chosen architecture or loss differentially penalises some subpopulations. Aggregation bias, fitting one model across heterogeneous groups whose underlying relationships differ, is the canonical instance, and the fix is per-group or hierarchical models.
Deployment bias: a model accurate in development fails when the population, intent, or interface in production differs from the test set. A clinical risk score validated in academic teaching hospitals will perform differently in rural community clinics; the gap is rarely benign across demographics.

Suresh and Guttag's 2021 framework gives a fuller taxonomy. The lesson for the practitioner is that "the model is biased" is not a diagnosis. The diagnosis is which of these mechanisms is operating, because the mitigations differ.

Mitigation strategies

Three algorithmic families plus one procedural family cover most of the deployed work.

Pre-processing transforms or reweights the training data so that protected attributes are decorrelated from the label. Kamiran and Calders's reweighting scheme is the canonical example. The advantage is operational simplicity, the downstream training pipeline is unchanged, but information that genuinely predicts $Y$ in non-discriminatory ways can be lost in the transformation, hurting overall accuracy.
In-processing adds a fairness term to the loss function. Zafar and colleagues showed how to express equalised-odds as a convex constraint on a logistic regression, yielding a tractable optimisation. The control over the metric is direct, but the constraint is wired into training, so changing the criterion later requires retraining.
Post-processing adjusts decision thresholds per group to equalise the chosen metric. The Hardt–Price–Srebro construction shows how to derive the optimal per-group thresholds for equalised odds from the predicted score distributions. The technique is operationally cheap and works on top of an opaque base model, but it requires the protected attribute at inference time, which is often unavailable for legal reasons.
Process: stakeholder participation in metric selection, external audits with held-out subgroup data, model cards and datasheets that document training-set demographics, and post-deployment monitoring. None of these is a substitute for the algorithmic work, but the algorithmic work tends to fail silently in their absence.

What fairness can't fix

Group fairness metrics are statements about conditional probabilities, and they leave the causal question untouched. The natural fairness question, would this individual have received the loan if they had been a different race, holding everything else equal?, is counterfactual, and counterfactuals are not estimable from observational data without further assumptions. Kusner, Loftus, Russell and Silva's 2017 framework of counterfactual fairness, and the related causal-inference work of Kilbertus and colleagues, formalise this by requiring a structural causal model: a directed graph of the variables and the mechanisms that link them. Fairness is then defined as invariance of the prediction under intervention on the protected attribute and the variables it causes.

The framework is illuminating. It shows precisely why "fairness through unawareness", simply dropping the protected attribute from the feature set, is inadequate whenever the attribute exerts influence through other observed variables, which is almost always. It also shows why two systems with identical group-level statistics can be very different from the perspective of an individual. But the price is a structural causal model that few real deployments have, and on which reasonable experts disagree. Counterfactual fairness sharpens our thinking before it reliably guides our engineering.

There are also concerns the fairness literature handles awkwardly because they sit outside its frame. Intersectional bias, error patterns that affect, say, darker-skinned women specifically rather than darker-skinned people or women in general, fragments the data into subgroups too small for stable group-level statistics. Long-run feedback effects, in which a model's decisions change the population it later sees (predictive policing concentrates patrols and so concentrates recorded crime), are dynamic phenomena that static fairness criteria are not designed to capture. Distribution shift between training and deployment can quietly invert which group is advantaged. Fairness criteria are useful, necessary, and not sufficient.

What you should take away

Bias in deployed models is the predictable result of fitting a model to historical data; it does not require a malicious engineer, and it does not go away on its own.
The three workhorse group-fairness criteria, demographic parity, equal opportunity, equalised odds, plus calibration are mutually inconsistent under realistic base rates and imperfect models; the engineer must choose, and the choice is normative.
The COMPAS controversy is not resolved by better statistics. ProPublica and Northpointe were measuring different, equally valid, and provably incompatible quantities.
Diagnose the source of bias, sampling, label, feature, model, deployment, before reaching for a mitigation; the mitigations are not interchangeable.
Fairness metrics measure what the model does on average to groups. They do not answer the individual counterfactual question, and they do not substitute for participation, audit and ongoing monitoring of the deployed system.