Ethics & Safety: 16.15 Privacy and data protection

Dr Chris Paton

16.15 Privacy and data protection

Machine learning models trained on personal data do not merely learn statistical patterns. They also memorise. A language model trained on medical notes can reproduce a patient's discharge summary almost verbatim if prompted with a sufficiently distinctive opening phrase. A face-recognition network trained on a hospital's staff directory can leak whether a particular employee was in the training set, because the network is just slightly more confident on faces it has already seen. A credit-scoring model trained on loan histories carries within its weights enough trace of every applicant for an attacker, given query access, to reconstruct individual records with disquieting accuracy. The training data does not vanish into anonymity; it is folded into the parameters in ways that often remain partially recoverable.

This is not a hypothetical concern. Carlini and colleagues have repeatedly demonstrated that production language models reproduce verbatim chunks of training corpora, including names, addresses, telephone numbers and software keys. Healthcare deployments are subject to the Health Insurance Portability and Accountability Act in the United States, the General Data Protection Regulation in the European Union, and the Health Information Privacy Code in New Zealand. Each requires data controllers to demonstrate that personal information has not been disclosed beyond the purposes for which consent was given. A model that quietly memorises records and leaks them under query is, in a regulatory sense, a disclosure mechanism.

The previous section, on bias and fairness, examined how models can systematically wrong groups of people through skewed predictions. This section turns to a different harm: the leakage of information about specific individuals through the model itself. Differential privacy, originally proposed by Cynthia Dwork and colleagues in 2006, provides the dominant formal framework for limiting that leakage, and DP-SGD (Abadi et al. 2016) is the standard mechanism for training neural networks under such guarantees.

Symbols Used Here

$\epsilon, \delta$DP parameters (privacy loss budget and slack)

$\sigma$noise scale added during training

$C$gradient clipping threshold ($\ell_2$ norm)

$\mathcal{D}, \mathcal{D}'$neighbouring datasets differing in one record

$\Delta f$sensitivity of a function to a single-record change

Differential privacy: definition

A randomised algorithm $\mathcal{M}$ is $(\epsilon, \delta)$-differentially private if, for any two neighbouring datasets $\mathcal{D}$ and $\mathcal{D}'$ that differ in exactly one record, and for any measurable subset $S$ of the output space,

$$ P(\mathcal{M}(\mathcal{D}) \in S) \le e^\epsilon \cdot P(\mathcal{M}(\mathcal{D}') \in S) + \delta. $$

In plain English: removing or adding any single individual changes the probability of any output by at most a factor of $e^\epsilon$, with an additive slack of $\delta$. The smaller $\epsilon$ is, the more nearly identical the two output distributions must be, and the harder it becomes for any observer to tell whether your record was in the input. The parameter $\delta$ is conventionally set to a cryptographically small value, typically $1/n$ or $1/n^2$ where $n$ is the dataset size, so that the slack is overwhelmingly unlikely to dominate.

The strength of the definition lies in what it does not assume. It makes no claim about the attacker's auxiliary knowledge, computational resources, or strategy. An adversary who knows everyone in the dataset apart from one person, has unlimited compute, and is allowed to query the model arbitrarily, still cannot reliably tell whether the one missing person was in or out. The definition gives a worst-case bound that holds against every conceivable attack.

A useful way to read $\epsilon$ is as a privacy loss budget. If $\epsilon = 0.1$, the output odds shift by at most about ten per cent depending on whether you are in or out, which is barely detectable. If $\epsilon = 1$, the odds may shift by a factor of $e \approx 2.7$, which is meaningful but bounded. If $\epsilon = 10$, the odds may shift by a factor of $e^{10} \approx 22{,}000$, which provides almost no protection in practice. Operating points of $\epsilon \in [0.5, 4]$ are typical in the literature; values above eight are sometimes reported but are not strong privacy in any meaningful sense. The choice of $\epsilon$ is, in the end, a policy decision rather than a technical one.

Differential privacy composes gracefully. Running an $\epsilon_1$-DP analysis followed by an $\epsilon_2$-DP analysis on the same data yields, in the worst case, an $(\epsilon_1 + \epsilon_2)$-DP system. Tighter accountants, such as the moments accountant or Rényi differential privacy, give substantially better composition for repeated Gaussian mechanisms, which is the regime that DP-SGD inhabits.

Mechanisms

The two foundational mechanisms convert any deterministic function into a private one by adding calibrated noise. For a query $f: \mathcal{D} \to \mathbb{R}$, the relevant quantity is the sensitivity $\Delta f = \max_{\mathcal{D}, \mathcal{D}'} |f(\mathcal{D}) - f(\mathcal{D}')|$, the largest amount by which the output can change when a single record is altered.

The Laplace mechanism delivers $\epsilon$-DP, the strongest form. Release $f(\mathcal{D}) + \mathrm{Lap}(\Delta f / \epsilon)$, where $\mathrm{Lap}(b)$ is a sample from the Laplace distribution with scale $b$. For counting queries with sensitivity one, this means adding noise of standard deviation $\sqrt{2}/\epsilon$, roughly one-and-a-half people's worth of fuzz at $\epsilon = 1$. For histograms, releasing each bin count with independent Laplace noise of scale $1/\epsilon$ gives a private histogram suitable for downstream analysis.

The Gaussian mechanism delivers the slightly weaker $(\epsilon, \delta)$-DP but composes better and extends naturally to vector-valued queries. For a query with $\ell_2$-sensitivity $\Delta_2 f$, releasing $f(\mathcal{D}) + \mathcal{N}(0, \sigma^2 I)$ with

$$ \sigma = \frac{\Delta_2 f \sqrt{2 \ln(1.25/\delta)}}{\epsilon} $$

is $(\epsilon, \delta)$-DP. The Gaussian mechanism is the workhorse of deep-learning DP because gradients are vectors and tracking $\ell_2$ sensitivity is far more natural than tracking $\ell_1$ sensitivity component-wise.

Both mechanisms have a common structure. Bound the per-record influence; add noise of scale proportional to that bound divided by the privacy budget. The smaller the budget, the larger the noise, the more degraded the answer. A useful exercise is to compute the noise required to release the mean systolic blood pressure of a cohort of one thousand patients at $\epsilon = 1$. With per-record sensitivity bounded by, say, 100 mmHg over the dataset size, the Laplace noise has scale 0.1, which is utterly negligible compared with the natural variability in the population. Aggregate statistics are cheap to release privately. Per-individual queries are not, and that asymmetry is the root of why DP works at all.

DP-SGD

Abadi, Chu, Goodfellow and colleagues proposed differentially private stochastic gradient descent in 2016. It remains the standard private-training algorithm. Each step of DP-SGD performs the following operations:

Sample a minibatch by Poisson sampling, each example is included independently with probability $q = B/n$, where $B$ is the expected batch size and $n$ is the dataset size. Poisson sampling is essential for the privacy amplification analysis; uniform random batches give weaker bounds.
Compute per-example gradients $g_i = \nabla_\theta \ell(\theta, x_i)$ rather than batch-averaged gradients. This is the most computationally awkward part of DP-SGD and the reason for libraries such as Opacus, JAX vmap and PyTorch functional gradients.
Clip each gradient to $\ell_2$-norm at most $C$, using $\tilde{g}_i = g_i \cdot \min(1, C / \|g_i\|_2)$. The clipping bound $C$ is the per-record sensitivity. Without clipping, sensitivity is unbounded and the privacy guarantee collapses.
Sum the clipped gradients and add Gaussian noise of scale $\sigma C$, giving $\tilde{G} = \sum_i \tilde{g}_i + \mathcal{N}(0, \sigma^2 C^2 I)$. This is a single Gaussian mechanism instance with sensitivity $C$.
Average and update: $\theta \leftarrow \theta - \eta \tilde{G}/B$.

The privacy analysis tracks the cumulative budget across iterations using the moments accountant or, more commonly today, Rényi differential privacy (Mironov 2017). For a typical training run of $T$ steps with sampling rate $q$ and noise multiplier $\sigma$, the resulting $(\epsilon, \delta)$ guarantee depends roughly on $\epsilon \approx O(q \sqrt{T \ln(1/\delta)} / \sigma)$. Doubling the noise halves the privacy loss; doubling the number of steps adds about $\sqrt{2}$ to it.

The trade-off between accuracy and privacy is real and substantial. On CIFAR-10, training a ResNet to about ninety-five per cent accuracy is routine without privacy; at $\epsilon = 8$ accuracy falls to roughly seventy-five per cent, and at $\epsilon = 1$ it can drop below sixty per cent. On language modelling, recent work has narrowed the gap considerably through pre-training on public data followed by private fine-tuning, but a five-to-fifteen-percentage-point degradation under strong DP remains the rule of thumb. Choosing larger batches, careful clipping bounds and longer schedules can recover some of the loss; the underlying tension between signal-to-noise and privacy budget cannot be eliminated.

Three practical points are worth emphasising. First, the clipping bound $C$ is a hyperparameter, not a property of the data. Setting it too small destroys signal; setting it too large bloats the noise that must be added to keep sensitivity matched. A common heuristic is to monitor the median per-example gradient norm during a non-private warm-up and set $C$ near that median. Second, per-example gradient computation is the dominant cost of DP-SGD; libraries such as Opacus exploit functorch and vectorised maps to recover most of the throughput, but a two- to four-fold slowdown remains typical. Third, the choice of optimiser matters: momentum and Adam-style methods carry private state across steps, and the accountant must include each step's contribution. Methods such as DP-Adam reset the second-moment estimates carefully to avoid leakage through the optimiser's running statistics.

Membership inference

The simplest and most-studied privacy attack is membership inference. The attacker is given black-box access to a trained model and a candidate record; the task is to decide whether that record was in the training set. Even moderately overfit models leak this information, because they tend to be more confident on training examples than on held-out ones.

Shokri and colleagues demonstrated the attack in 2017 by training shadow models on shadow datasets drawn from the same distribution and learning a classifier that distinguishes training behaviour from test behaviour by output distribution. Carlini, Tramèr and colleagues showed in 2022 that even well-regularised models leak: their LiRA attack, which calibrates per-example log-likelihood ratios across many shadow models, achieves AUC values above 0.9 on undefended CIFAR-10 classifiers. For language models, the analogous attack succeeds on training sentences that are sufficiently distinctive, with extraction rates that scale alarmingly with model size.

Differential privacy provides a formal upper bound on the success probability of any membership inference adversary. Specifically, no attacker, irrespective of their auxiliary information or compute, can achieve a true-positive rate exceeding $e^\epsilon \cdot \mathrm{FPR} + \delta$ at any chosen false-positive rate. At $\epsilon = 1$, an attacker with FPR of one per cent can have TPR no greater than about $2.7$ per cent, barely above chance. The bound is uniform across all records and all attack strategies; it is the strongest guarantee any defence can offer.

In practice, membership inference is now used as an empirical privacy audit. Even when DP-SGD is applied, running a state-of-the-art membership inference attack against the final model and reporting its AUC gives a concrete, attackable number that complements the formal $\epsilon$.

Federated learning

Federated learning is a complementary architectural choice rather than a privacy mechanism per se. Instead of centralising raw data on a server, training is distributed across clients, phones, hospitals, banks, each of which computes gradient updates on its local data and transmits only those updates. A coordinating server aggregates updates into a global model, which is then redistributed for the next round. McMahan and colleagues introduced federated averaging in 2017; it now powers Google's Gboard keyboard prediction and several Apple on-device features.

Federated learning alone is not private. Geiping and colleagues showed in 2020 that an honest-but-curious server can reconstruct individual training images, often pixel-perfectly, from a single gradient update on small models. The intuition is straightforward: a gradient is a linear function of the loss, which is a function of the training example, and on small networks that function is often invertible. To obtain meaningful privacy, federated systems must combine federated averaging with secure aggregation (Bonawitz et al. 2017), a cryptographic protocol ensuring that the server learns only the sum of client updates, and with DP noise injected at the client side under DP-FedAvg (McMahan et al. 2018). The combination yields user-level differential privacy, in which a single client's entire dataset is the unit of protection.

Federated learning is most appropriate where data cannot be centralised for legal, contractual or product reasons. It is not a substitute for differential privacy; it is a deployment pattern that reduces centralisation risk and pairs naturally with DP.

Where privacy matters

Healthcare: HIPAA in the United States, GDPR in the European Union, and the Health Information Privacy Code in New Zealand all impose strict obligations on data controllers. A model deployed on hospital records is, in practice, a release of information derived from those records, and regulators have begun treating model outputs as disclosures. Differential privacy gives a defensible technical answer.
Financial services: PCI-DSS for payment data and the Basel-aligned model risk frameworks for credit decisions both require that models do not leak customer-identifying information. Membership inference on a fraud-detection model can reveal whether a specific transaction was flagged for review.
Large language models: training corpora include personal correspondence, medical notes, government filings and source code containing secrets. Carlini and colleagues have repeatedly demonstrated that frontier models reproduce such content verbatim under appropriate prompts. Differentially private fine-tuning on user data is now an active production concern at the major labs.
Public-sector and regulated AI: the EU AI Act's high-risk categories, including law enforcement, immigration and education, attach data-protection requirements to model deployment. Auditable privacy guarantees are increasingly part of conformity assessment.

What you should take away

Models memorise. Training on personal data without explicit privacy controls leaves individually identifiable traces in the parameters that can be recovered through query access.
Differential privacy is the dominant formal framework. An $(\epsilon, \delta)$-DP algorithm bounds, by $e^\epsilon$ multiplicatively and $\delta$ additively, how much any single record can change the output distribution; the bound holds against any attacker with any auxiliary knowledge.
DP-SGD is the standard private training procedure. Its five steps, Poisson sampling, per-example gradients, $\ell_2$ clipping at $C$, Gaussian noise of scale $\sigma C$, average and update, must all be present, and the privacy budget is tracked across iterations by an accountant such as RDP.
The accuracy cost is real but manageable. Strong DP at $\epsilon \approx 1$ typically degrades accuracy by five to fifteen percentage points on standard benchmarks. Public pre-training followed by private fine-tuning narrows the gap considerably.
Federated learning and secure aggregation reduce centralisation risk but do not by themselves provide privacy. To obtain formal guarantees in distributed settings, combine federated averaging with client-side DP noise. Privacy is a design property of the whole pipeline, not a feature toggled on at the end.