Adversarial Training, Glossary, Textbook of AI

Adversarial training is the most reliable empirical defence against adversarial examples. Formalised by Madry et al. (2018), it casts robust learning as a saddle-point problem: minimise the worst-case loss over a perturbation budget rather than the average loss on clean inputs.

Min-max objective. Given a model $f_\theta$, training distribution $\mathcal{D}$, perturbation set $\mathcal{S} = \{\boldsymbol{\delta} : \|\boldsymbol{\delta}\|_p \le \epsilon\}$, and loss $\mathcal{L}$,

$$\min_\theta\; \mathbb{E}_{(\mathbf{x}, y) \sim \mathcal{D}} \!\left[\, \max_{\boldsymbol{\delta} \in \mathcal{S}} \mathcal{L}\big(\theta, \mathbf{x} + \boldsymbol{\delta}, y\big) \right].$$

The inner maximisation is intractable in closed form but well-approximated by a strong attack , typically PGD.

Algorithm.

Sample mini-batch $\{(\mathbf{x}_i, y_i)\}$.
For each $\mathbf{x}_i$, run PGD-$K$ to produce an adversarial example $\mathbf{x}_i' = \mathbf{x}_i + \boldsymbol{\delta}_i^*$.
Compute the loss on adversarial inputs: $\frac{1}{B} \sum_i \mathcal{L}(\theta, \mathbf{x}_i', y_i)$.
Back-propagate and update $\theta$ via SGD or Adam.

The cost is roughly $K{+}1$ forward/backward passes per training step, where $K \approx 7$ on CIFAR-10 or $K \approx 3$ on ImageNet to keep training tractable.

Empirical findings.

On CIFAR-10 with $\epsilon = 8/255$ ($\ell_\infty$), adversarially trained ResNets achieve $\sim 87\%$ clean accuracy and $\sim 47\%$ robust accuracy, versus $0\%$ robust accuracy for standard training.
There is a clean–robust trade-off: robust models give up some clean accuracy. Tsipras et al. (2019) argue this is fundamental, the robust hypothesis class is smaller.
Sample complexity is higher: robust generalisation requires substantially more data, motivating semi-supervised approaches that leverage unlabelled data (Carmon et al., 2019).

Variants.

TRADES. Decomposes the loss into a natural-error term and a robustness regulariser: $$\mathcal{L}_{\text{TRADES}} = \mathcal{L}(f_\theta(\mathbf{x}), y) + \beta \cdot \mathrm{KL}\!\big(f_\theta(\mathbf{x}) \,\|\, f_\theta(\mathbf{x}')\big),$$ tuning the trade-off via $\beta$ (Zhang et al., 2019).
Free adversarial training. Reuses gradient computations across the inner attack and the outer update to amortise cost (Shafahi et al., 2019).
Fast adversarial training. Uses single-step FGSM with random initialisation; works on CIFAR but suffers from "catastrophic overfitting" on ImageNet (Wong et al., 2020).
Curriculum adversarial training. Slowly grows $\epsilon$ during training to ease optimisation.

Limitations.

Robustness is specific to the threat model, a model robust to $\ell_\infty$ may fail under $\ell_2$, rotations, or natural distribution shift.
Adversarial training does not certify robustness; only randomised smoothing or interval-bound propagation provide formal guarantees.
It is roughly $K$ times slower than standard training, making it expensive for large models.

Adversarial training nonetheless remains the de-facto baseline against which all empirical defences are compared, and forms the foundation for certifiably robust training pipelines.

Discussed in:

Chapter 12: Sequence Models, Robustness

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).