Adversarial training is the most reliable empirical defence against adversarial examples. Formalised by Madry et al. (2018), it casts robust learning as a saddle-point problem: minimise the worst-case loss over a perturbation budget rather than the average loss on clean inputs.
Min-max objective. Given a model $f_\theta$, training distribution $\mathcal{D}$, perturbation set $\mathcal{S} = \{\boldsymbol{\delta} : \|\boldsymbol{\delta}\|_p \le \epsilon\}$, and loss $\mathcal{L}$,
$$\min_\theta\; \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}} \!\left[\, \max_{\boldsymbol{\delta} \in \mathcal{S}} \mathcal{L}\big(\theta, \mathbf{x} + \boldsymbol{\delta}, y\big) \right].$$
The inner maximisation is intractable in closed form but well-approximated by a strong attack , typically PGD.
Algorithm.
- Sample mini-batch $\{(\mathbf{x}_i, y_i)\}$.
- For each $\mathbf{x}_i$, run PGD-$K$ to produce an adversarial example $\mathbf{x}_i' = \mathbf{x}_i + \boldsymbol{\delta}_i^*$.
- Compute the loss on adversarial inputs: $\frac{1}{B} \sum_i \mathcal{L}(\theta, \mathbf{x}_i', y_i)$.
- Back-propagate and update $\theta$ via SGD or Adam.
The cost is roughly $K{+}1$ forward/backward passes per training step, where $K \approx 7$ on CIFAR-10 or $K \approx 3$ on ImageNet to keep training tractable.
Empirical findings.
- On CIFAR-10 with $\epsilon = 8/255$ ($\ell_\infty$), adversarially trained ResNets achieve $\sim 87\%$ clean accuracy and $\sim 47\%$ robust accuracy, versus $0\%$ robust accuracy for standard training.
- There is a clean–robust trade-off: robust models give up some clean accuracy. Tsipras et al. (2019) argue this is fundamental, the robust hypothesis class is smaller.
- Sample complexity is higher: robust generalisation requires substantially more data, motivating semi-supervised approaches that leverage unlabelled data (Carmon et al., 2019).
Variants.
TRADES. Decomposes the loss into a natural-error term and a robustness regulariser: $$\mathcal{L}_{\text{TRADES}} = \mathcal{L}(f_\theta(\mathbf{x}), y) + \beta \cdot \mathrm{KL}\!\big(f_\theta(\mathbf{x}) \,\|\, f_\theta(\mathbf{x}')\big),$$ tuning the trade-off via $\beta$ (Zhang et al., 2019).
Free adversarial training. Reuses gradient computations across the inner attack and the outer update to amortise cost (Shafahi et al., 2019).
Fast adversarial training. Uses single-step FGSM with random initialisation; works on CIFAR but suffers from "catastrophic overfitting" on ImageNet (Wong et al., 2020).
Curriculum adversarial training. Slowly grows $\epsilon$ during training to ease optimisation.
Limitations.
- Robustness is specific to the threat model, a model robust to $\ell_\infty$ may fail under $\ell_2$, rotations, or natural distribution shift.
- Adversarial training does not certify robustness; only randomised smoothing or interval-bound propagation provide formal guarantees.
- It is roughly $K$ times slower than standard training, making it expensive for large models.
Adversarial training nonetheless remains the de-facto baseline against which all empirical defences are compared, and forms the foundation for certifiably robust training pipelines.
Related terms: Adversarial Examples, PGD Attack, KL Divergence, Gradient Descent
Discussed in:
- Chapter 12: Sequence Models, Robustness