Differential privacy (DP), introduced by Dwork, McSherry, Nissim and Smith (2006), is a mathematical definition of privacy that bounds how much an individual's presence or absence in a dataset can change the output of an analysis. It is the de-facto standard adopted by the U.S. Census Bureau (2020), Apple, Google, Meta, and used in privacy-preserving deep learning via the DP-SGD algorithm (Abadi et al., 2016).
Definition. A randomised mechanism $\mathcal{M}: \mathcal{X}^n \to \mathcal{Y}$ is $(\epsilon, \delta)$-differentially private if for any two neighbouring datasets $D, D'$ (differing in a single record) and any measurable $S \subseteq \mathcal{Y}$,
$$\Pr[\mathcal{M}(D) \in S] \;\le\; e^{\epsilon} \cdot \Pr[\mathcal{M}(D') \in S] + \delta.$$
The privacy budget $\epsilon$ controls how much the output distribution can shift; smaller is more private. $\delta$ is the failure probability and is typically set to $\delta \ll 1/n$. Pure $\epsilon$-DP corresponds to $\delta = 0$.
Key properties.
- Post-processing. Any function of the output is at least as private: if $f$ is data-independent then $f(\mathcal{M}(D))$ is also $(\epsilon, \delta)$-DP.
- Composition. Releasing $k$ independent $(\epsilon_i, \delta_i)$-DP outputs gives $(\sum_i \epsilon_i, \sum_i \delta_i)$-DP via basic composition; tighter bounds come from advanced composition or Rényi DP.
- Group privacy. $(\epsilon, \delta)$-DP for one record extends to $(k\epsilon, k e^{(k-1)\epsilon} \delta)$ for $k$ correlated records.
Standard mechanisms.
- Laplace mechanism. For a query $f$ with $\ell_1$ sensitivity $\Delta_1 f = \max_{D \sim D'} \|f(D) - f(D')\|_1$, release $f(D) + \mathrm{Lap}(\Delta_1 f / \epsilon)^d$ to achieve $\epsilon$-DP.
- Gaussian mechanism. With $\ell_2$ sensitivity $\Delta_2 f$, release $f(D) + \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I})$ with $\sigma \ge \Delta_2 f \cdot \sqrt{2 \ln(1.25/\delta)} / \epsilon$ to achieve $(\epsilon, \delta)$-DP.
DP-SGD for deep learning. A single SGD step's sensitivity is unbounded because gradients can be arbitrarily large. Abadi et al. (2016) clip per-example gradients then add noise:
- Sample. Take a Poisson-sampled mini-batch with rate $q = B/n$.
- Per-example gradient. Compute $\mathbf{g}_i = \nabla_\theta \mathcal{L}_i(\theta)$ for each sample $i$.
- Clip. Project each gradient to a $\ell_2$-ball of radius $C$: $$\bar{\mathbf{g}}_i = \mathbf{g}_i \cdot \min\!\left(1, \; \frac{C}{\|\mathbf{g}_i\|_2}\right).$$
- Aggregate and noise. $$\tilde{\mathbf{g}} = \frac{1}{B}\!\left( \sum_{i} \bar{\mathbf{g}}_i \;+\; \mathcal{N}\!\big(\mathbf{0}, \sigma^2 C^2 \mathbf{I}\big) \right).$$
- Update. $\theta \leftarrow \theta - \eta \tilde{\mathbf{g}}$.
The moments accountant (or its tighter successor Rényi DP accountant) tracks the cumulative privacy loss across training steps, exploiting privacy amplification by sub-sampling: a $q$-rate sub-sample of an $\epsilon$-DP mechanism is roughly $q\epsilon$-DP for small $\epsilon$.
Privacy–utility trade-off. Stronger privacy (smaller $\epsilon$, larger $\sigma$) means noisier gradients and lower accuracy. Practical recommendations:
- $\epsilon \le 1$: strong, often $\sim 5$–$15$ percentage points accuracy loss versus non-private training on ImageNet.
- $\epsilon \approx 3$–$10$: moderate, common in production.
- $\epsilon > 10$: weak guarantees, mostly heuristic.
Local DP. Noise is added on the client before any data leaves it, giving stronger guarantees but at the cost of more accuracy. Used in Apple's keyboard typing telemetry and Google's RAPPOR.
Why it matters. DP gives a principled, composable, worst-case guarantee against arbitrary post-hoc inference attacks, including membership-inference and training-data extraction attacks demonstrated against undefended LLMs. Combined with federated learning it forms a defensible privacy stack for sensitive domains such as healthcare and finance.
Related terms: Federated Learning, Gaussian Distribution, Gradient Descent
Discussed in:
- Chapter 15: Modern AI, Privacy