Membership Inference Attacks, Glossary, Textbook of AI

A membership inference attack (MIA) decides, given access to a trained model and a candidate record, whether that record was a member of the model's training set. A successful MIA is a direct violation of training-data privacy: confirming that Alice's medical record was in a hospital's training set discloses, at minimum, that Alice was a patient at that hospital, possibly with a particular condition implied by the dataset's purpose. The attack was formalised by Shokri, Stronati, Song and Shmatikov (2017) in Membership Inference Attacks Against Machine Learning Models.

Mechanism

Most MIAs exploit overfitting: a model assigns higher confidence (lower loss) to examples it was trained on than to similar but unseen examples. The attacker:

Queries the target model on the candidate record and observes the loss or confidence score.
Compares the score against a calibration distribution, either an empirical threshold or scores from shadow models trained on similar data with known membership.
Outputs "member" if the candidate's score is anomalously high in confidence (low in loss).

More sophisticated variants, LiRA (Carlini et al. 2022), fit per-example likelihood ratios and achieve much higher attack success at low false-positive rates.

Application to LLMs

On large language models, MIAs have been demonstrated against fine-tuning datasets and, less reliably, against pre-training corpora. Shi et al. (2024) introduced the Min-K% test, which examines whether the lowest-probability tokens in a candidate document are anomalously rare, a signature of memorisation. Practical attack success on frontier-scale pre-training corpora remains modest (above-random but well below ceiling), reflecting the de-duplication and limited overfitting of large training runs.

Defences

Differential privacy, DP-SGD provably bounds membership-inference success as a direct corollary of the DP guarantee.
Reduced overfitting, regularisation, early stopping, larger datasets.
Output limitation, return only the top class label rather than full confidence scores (helps modestly but does not eliminate the attack).
Knowledge distillation, train a student on the teacher's outputs; the student is generally less vulnerable.

Status and policy

MIAs underpin much of the legal argument that training-data privacy obligations apply to ML systems. The EU AI Act, GDPR Article 22, and HIPAA have all been invoked in cases turning on what counts as "personal data processing" by a trained model, MIA results are the most direct empirical demonstration that the model does process personal data of identifiable individuals.

References

Shokri et al. (2017). Membership Inference Attacks Against Machine Learning Models.
Carlini et al. (2022). Membership Inference Attacks From First Principles (LiRA).
Shi et al. (2024). Detecting Pretraining Data from Large Language Models (Min-K%).

Discussed in:

Chapter 14: Generative Models, Privacy in ML

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).