ML Fundamentals: 6.17   Putting it all together: a worked project

Dr Chris Paton

6.17 Putting it all together: a worked project

Suppose we are building a screening tool to predict which kidney-transplant candidates are at high risk of acute rejection in the first month. We have a retrospective dataset of 8,000 transplants, 600 of which experienced rejection (7.5% base rate). For each, we have the pre-transplant features: HLA mismatch, recipient age, donor age, cold ischaemia time, panel reactive antibody, prior transplant, and 15 routine lab values. The label is binary: rejection in 30 days.

Step 1: Define $T$, $P$, $E$

$T$: predict $p(\text{rejection in 30 days} \mid x)$ at the time the kidney is offered.
$P$: AUC-PR (because the class is imbalanced); secondarily, expected calibration error.
$E$: the 8,000 historical transplants from our centre, plus the public KIDMO registry (an additional 50,000 cases).

Step 2: data audit

Are any features collected post-transplant? Yes, a "duration of induction immunosuppression" field, which is decided after the operation. Drop it (target leakage).
Are there multiple transplants per patient? Yes, about 5% of recipients have a previous transplant in the registry. Use group splitting on recipient_id.
Is the label complete? No, patients lost to follow-up before 30 days have an unknown outcome. Decide: censor them as "no rejection" (optimistic), drop them (potentially biased), or model them with survival analysis (best, but more complex).

Step 3: hold-out

Split temporally. Use transplants from 2010–2019 for training and validation. Use 2022–2025 as the test set. This mimics the deployment setting and exposes any temporal drift in clinical practice.

Step 4: pipeline

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression

numeric = ["recipient_age", "donor_age", "cold_ischaemia_h", "pra",
           "creatinine", "potassium", "haemoglobin", "albumin"]  # plus 7 more
categorical = ["blood_type", "centre_id", "donor_type"]

pre = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")),
                      ("scale", StandardScaler())]), numeric),
    ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")),
                      ("ohe", OneHotEncoder(handle_unknown="ignore"))]), categorical),
])

clf = Pipeline([
    ("pre", pre),
    ("logreg", LogisticRegression(class_weight="balanced",
                                  C=1.0, max_iter=2000)),
])

The imputer and the scaler are inside the pipeline so they are refit on every CV fold.

Step 5: cross-validation

Use 5-fold GroupKFold on recipient_id for the inner loop and a temporal split for the outer "test." Tune C on a log-scale grid; the natural metric is AUC-PR.

Step 6: calibration

Logistic regression with class weighting is poorly calibrated on the original prevalence. Fit Platt scaling on a held-out 10% of training data (entirely separate from the test set) and report ECE on the test set.

Step 7: comparison

Train a gradient-boosted-trees model (XGBoost) on the same pipeline. Compare AUC-PR with paired bootstrap confidence intervals over the test set. Report both. Where they disagree, present the precision–recall curves and the calibration curves so clinicians can see the tradeoff.

Step 8: honest reporting

Report AUC-PR, ECE, and a precision–recall curve with bootstrap bands. Note the 7.5% base rate. State the temporal split. List excluded patients and the reason. Submit code and trained model for independent reproduction.

The model is not finished; it must be evaluated prospectively on incoming patients before being deployed in the clinic. But the process to this point is the practice of honest applied ML.