PGD Attack, Glossary, Textbook of AI

Projected Gradient Descent (PGD) is the most widely used iterative adversarial attack, introduced by Madry et al. (2018) as the threat model that adversarial training should defend against. It is the standard benchmark for empirical robustness evaluation under $\ell_\infty$ and $\ell_2$ constraints.

Definition. Given a clean input $\mathbf{x}$ with label $y$, a loss $\mathcal{L}$ (typically cross-entropy), and a perturbation budget $\epsilon$ in some $\ell_p$ norm, PGD iterates

$$\mathbf{x}_{t+1} = \Pi_{\mathcal{S}}\!\Big(\mathbf{x}_t + \alpha \cdot \mathrm{sign}\!\big(\nabla_{\mathbf{x}} \mathcal{L}(\theta, \mathbf{x}_t, y)\big)\Big),$$

where $\mathcal{S} = \{\mathbf{x}' : \|\mathbf{x}' - \mathbf{x}\|_\infty \le \epsilon\}$ and $\Pi_{\mathcal{S}}$ is the Euclidean projection onto $\mathcal{S}$ (for $\ell_\infty$ this is element-wise clipping). The step size $\alpha$ is typically set to $\epsilon / 4$ and the attack runs for $T = 10$–$100$ iterations.

For the $\ell_2$ variant the update direction is $\nabla_{\mathbf{x}} \mathcal{L} / \|\nabla_{\mathbf{x}} \mathcal{L}\|_2$, and the projection rescales any iterate outside the ball back to the surface.

Initialisation. PGD usually starts from a random point in $\mathcal{S}$:

$$\mathbf{x}_0 = \mathbf{x} + \mathbf{u}, \quad \mathbf{u} \sim \mathrm{Uniform}([-\epsilon, \epsilon]^d).$$

Random restarts (running PGD from several random initialisations and taking the strongest example) significantly increase attack success rates, exposing local minima in the loss surface where simpler attacks like FGSM would stop.

Why iterative. FGSM is one PGD step with $\alpha = \epsilon$; it is fast but suffers from gradient masking, where the local linear approximation breaks down and the attack underestimates true robustness. PGD's smaller steps trace the loss surface more accurately, and projection keeps each iterate inside the threat model.

Universal first-order adversary. Madry et al. argue that PGD is, empirically, the strongest first-order $\ell_\infty$ attack: any model robust against PGD with sufficient steps and restarts is robust against all gradient-based attacks of the same threat model. This conjecture has held up well across benchmarks (CIFAR-10, ImageNet) though stronger black-box and certified attacks exist.

Use in evaluation. A standard robustness report on CIFAR-10 includes:

Clean accuracy.
Accuracy under $\ell_\infty$ PGD-20 with $\epsilon = 8/255$, $\alpha = 2/255$, 10 random restarts.
Accuracy under stronger ensembles like AutoAttack.

Use in adversarial training. PGD is also the inner-loop oracle in adversarial training: at each training step, generate an adversarial example for the current mini-batch using PGD, then back-propagate through the worst-case loss. This min-max formulation gives the strongest empirical defence against $\ell_\infty$ perturbations.

Caveats.

PGD inherits the limitations of gradient-based attacks: gradient masking by defences like obfuscated gradients can make models appear robust without truly being so. Athalye, Carlini and Wagner (2018) demonstrated multiple defences that fooled PGD but fell to backward-pass differentiable approximation.
Robustness against $\ell_\infty$ does not transfer to other threat models such as semantic perturbations (rotations, colour shifts).

Discussed in:

Chapter 12: Sequence Models, Robustness

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).