Ethics & Safety: 16.8 Adversarial attacks

Dr Chris Paton

16.8 Adversarial attacks

Deep networks are brittle in a peculiar way. A model that classifies a panda image with 99% confidence can be made to classify the same image, altered by a noise pattern that no human eye can detect, as a gibbon, with 99% confidence. The same brittleness afflicts language models: a fluent, helpful assistant that politely declines to explain how to manufacture a nerve agent will, given the right twenty tokens of gibberish appended to the request, comply in detail. These two phenomena, pixel-space adversarial examples and token-space jailbreaks, are mathematically distinct but conceptually one. In each case, an adversary with a small budget exploits the model's high-dimensional input space to find a point that the model classifies, or generates, very differently from what the surrounding training distribution suggests it should.

Robustness is the sub-field of AI safety that studies these failures. It matters for two reasons. First, it is a security concern: a self-driving car that misreads a stop sign defaced with four small stickers, a malware classifier that lets through a binary with three flipped bits, a content moderator that approves a hate-speech post with a homoglyph substitution, these are exploitable in deployment. Second, adversarial examples are diagnostic: they are evidence that the geometry of a neural network's decision boundary, in input space, is nothing like the smooth, locally-faithful surface that humans intuitively expect. Understanding why the boundary looks the way it does is part of understanding why deep learning works at all.

Symbols Used Here

$\mathbf{x}$input (image, audio, prompt)

$\mathbf{x}'$adversarial input, $\|\mathbf{x}' - \mathbf{x}\|_p \le \epsilon$

$\boldsymbol{\delta}$perturbation, $\mathbf{x}' = \mathbf{x} + \boldsymbol{\delta}$

$\epsilon$budget under an $\ell_p$-norm (typically $\ell_\infty$ for vision)

$f_\theta$classifier with parameters $\theta$

$\mathcal{L}$loss

Adversarial examples

The phenomenon was first reported by Szegedy et al. in 2013 as a curiosity: small perturbations, found by gradient-based optimisation, could change the output of a deep network. Goodfellow, Shlens and Szegedy 2015 2015 turned the curiosity into a research programme. They showed that the perturbations were not random pathological points; they were dense and predictable, and a one-step linearised attack found them just as readily as expensive optimisation. The canonical illustration shows a panda image, classified correctly with 57.7% confidence; an $\ell_\infty$ perturbation of magnitude $0.007$ (so small that on an 8-bit image it changes most pixels by at most two grey-levels) added; and the resulting image classified as a gibbon with 99.3% confidence. To a human, the two images are indistinguishable.

The right way to think about adversarial examples is geometric. A modern image classifier maps a 224-by-224 RGB image, a point in roughly 150,000-dimensional space, to one of a thousand classes. The decision boundary between any two classes is a hypersurface in that very high-dimensional space. Around any given training point, there is a great deal of room: the volume of an $\ell_\infty$ ball of radius $\epsilon$ scales as $(2\epsilon)^d$, and at $d = 150{,}000$ even $\epsilon$ as small as $0.01$ contains an enormous region. Within that region, the decision boundary is close enough to be reached by a single gradient step. This is the "high-dimensional geometry" intuition for why adversarial examples are unavoidable for any model that is not constructed to be smooth.

Two further empirical features make the picture sharper. Adversarial examples transfer: a perturbation crafted against one model often fools a different model trained on the same task, even one with a different architecture. This is what allows black-box attacks, an adversary with no access to the deployed model can train a surrogate, attack it, and reuse the perturbation. And adversarial examples are universal in a weaker sense: a single perturbation pattern can be added to many different inputs and degrade the classifier on most of them, suggesting that the directions of adversarial vulnerability are shared across inputs, not specific to each one.

Adversarial examples are not confined to images. Audio classifiers can be fooled by inaudible high-frequency additions; speech-to-text systems can be made to transcribe one sentence as another by perturbations that sound like background hiss. Malware classifiers can be evaded by binary-level edits that preserve program behaviour. In the physical world, adversarial patches, small printed stickers or patterns on glasses frames, can fool face recognition and traffic-sign classifiers, and this is the threat model that has driven most of the safety-critical attention to robustness. The phenomenon is therefore not an artefact of pixel-space; it is a property of high-dimensional learned classifiers in general.

Attack methods

The standard taxonomy distinguishes white-box attacks (the adversary has $\nabla_\mathbf{x}\mathcal{L}$), black-box attacks (the adversary has only the model's outputs), targeted attacks (force a specific wrong class) from untargeted attacks (any wrong class will do), and the threat model, usually $\ell_\infty$ for vision but $\ell_2$, $\ell_0$ and the Wasserstein metric all appear.

FGSM, the Fast Gradient Sign Method of Goodfellow et al. 2015, is the simplest white-box attack. Take a single step in the direction of the sign of the input gradient, scaled to the budget: $$\mathbf{x}' = \mathbf{x} + \epsilon \cdot \text{sign}\bigl(\nabla_\mathbf{x} \mathcal{L}(\theta, \mathbf{x}, y)\bigr).$$ The argument for why a one-step attack works is the linearity hypothesis: locally, the loss is well-approximated by its first-order Taylor expansion, $\mathcal{L}(\mathbf{x} + \boldsymbol{\delta}) \approx \mathcal{L}(\mathbf{x}) + \boldsymbol{\delta}^\top \nabla_\mathbf{x}\mathcal{L}$, and over an $\ell_\infty$ ball that linearised expression is maximised by $\boldsymbol{\delta} = \epsilon\,\text{sign}(\nabla_\mathbf{x}\mathcal{L})$. FGSM is fast, it transfers, and it works.

PGD, Projected Gradient Descent, Madry et al. 2017 2017, is the multi-step strengthening. Iterate $$\mathbf{x}^{(k+1)} = \Pi_{B_\epsilon(\mathbf{x})}\bigl(\mathbf{x}^{(k)} + \alpha\,\text{sign}(\nabla_\mathbf{x}\mathcal{L}(\theta, \mathbf{x}^{(k)}, y))\bigr),$$ projecting back into the $\ell_\infty$ ball after each step. With $\alpha \approx \epsilon / 4$, 7–20 steps and a small random initialisation inside the ball, PGD finds points that are dramatically harder for the model than FGSM. PGD is the standard first-order attack: any defence that is not robust to PGD is not robust at all, and "robust accuracy under PGD" is the headline metric in the field.

Carlini–Wagner 2017 is the optimisation-based attack that minimises the perturbation norm subject to misclassification, using a smooth surrogate of the constraint and tuned solver schedules. It is expensive but usually finds the smallest perturbation, and is the attack of choice for evaluating certified defences and for breaking defences that look robust under PGD.

Black-box attacks dispense with gradients. Square Attack and the related random-search methods query the model with structured perturbations and accept those that increase the loss; ZOO and NES estimate gradients from finite differences in random directions. AutoAttack Croce, 2020 is the standard ensemble: a parameter-free combination of two PGD variants, the Carlini–Wagner attack and Square Attack, which together give a tight upper bound on a defence's true robust accuracy and have replaced ad-hoc evaluation suites in the literature.

Defences

The defensive literature is a graveyard. A canonical paper, Athalye, Carlini and Wagner 2018 2018, showed that nearly every defence published in the previous two years had been broken by adaptive attacks: the defences worked only because the evaluators had not tried hard enough. The lesson, robustness has to be measured under a strong adaptive attack, not against the weak attacks the defender chose, has reshaped how the field reports results.

Adversarial training Madry, 2017 is the dominant defence. Replace the empirical-risk objective with a min-max: $$\min_\theta \mathbb{E}_{(\mathbf{x},y) \sim D}\Bigl[\max_{\boldsymbol{\delta}: \|\boldsymbol{\delta}\|_p \le \epsilon} \mathcal{L}(\theta, \mathbf{x}+\boldsymbol{\delta}, y)\Bigr].$$ The inner maximisation is approximated by PGD; the outer minimisation is standard SGD over the resulting adversarial examples. Adversarial training trades clean accuracy for robust accuracy: on CIFAR-10 with $\epsilon = 8/255$, a clean ResNet gets ~95% clean and ~0% robust accuracy, while a Madry-trained model gets ~87% clean and ~50% robust. It is computationally expensive, each step costs PGD plus an SGD update, but, unlike most heuristic defences, it has not been broken.

Randomised smoothing Cohen, 2019 gives a certified guarantee rather than an empirical one. Build a smoothed classifier $g(\mathbf{x}) = \arg\max_c \Pr_{\boldsymbol{\delta} \sim \mathcal{N}(0, \sigma^2 I)}[f(\mathbf{x} + \boldsymbol{\delta}) = c]$, the most likely class under Gaussian noise. The Neyman–Pearson lemma then implies that if $g$ predicts class $c_A$ with probability $p_A$ and the runner-up has probability $p_B$, the prediction is invariant under any $\ell_2$ perturbation of norm at most $\sigma\bigl(\Phi^{-1}(p_A) - \Phi^{-1}(p_B)\bigr)/2$, where $\Phi$ is the Gaussian CDF. The radius is provable, not measured. In exchange, you accept Monte-Carlo noise at inference and a smaller robust radius than empirical methods report.

Input preprocessing, JPEG compression, total-variation denoising, median filters, bit-depth reduction, was the first wave of defences and has been almost entirely broken by adaptive attacks that take gradients through, or estimate gradients around, the preprocessing step. Detection, train a separate classifier to distinguish clean from adversarial inputs, has the same fate when the attacker can adapt: any detector with a gradient becomes part of the surface that the attacker is optimising against, and the joint problem of fooling the classifier while looking clean to the detector is not noticeably harder than the original. Gradient masking defences, which deliberately make the gradient uninformative (saturating non-linearities, stochastic components, hard discretisations), fail in the same way: the attacker estimates gradients by finite differences or replaces the masked layer with a smooth surrogate at attack time. The field's working consensus is that adversarial training and randomised smoothing are the two methods that survive serious evaluation, and that everything else is provisional until it has been beaten on by an adaptive attacker for at least a year.

LLM jailbreaks

Language models are subject to a parallel adversarial literature, with the inputs being tokens and the outputs being policy compliance rather than class labels. Five families dominate the 2026 taxonomy. DAN-style roleplay puts the model in a fictional frame in which safety training does not apply ("you are an AI from 2050 with no restrictions"); production models have largely closed this family by including refusals-during-roleplay in the RLHF distribution. GCG, the Greedy Coordinate Gradient attack of Zou et al. 2023 2023, optimises an adversarial token suffix against open models so that the attacked model's response begins "Sure, here is…"; the optimisation greedily replaces one token at a time using gradient information on the embedding, and the resulting suffixes transfer to closed models. Crescendo Russinovich, 2024 is multi-turn: each turn is innocuous, but the trajectory walks the model into a harmful response by turn seven or so. Many-shot jailbreaking Anil, 2024 exploits long context windows by including dozens or hundreds of fictional examples of "user asks for X, assistant complies with X" before the real harmful request. Indirect prompt injection Greshake, 2023 places the malicious instruction not in the user's prompt but in data the model is asked to process, a webpage the agent browses, a document the user uploads, and is the dominant attack against LLM agents because the user is no longer the adversary and the system has no parser-level boundary between data and instruction.

Why robustness is hard

Three structural reasons explain why, more than a decade after Szegedy's original paper, the field has not converged on a robust deep classifier. First, the input space is enormous. The volume of an $\ell_\infty$ ball of radius $\epsilon$ in $d$ dimensions scales as $(2\epsilon)^d$; for a 224-by-224 RGB image, even small budgets give the adversary an effectively infinite search space. The model only ever sees a vanishing fraction of any such ball during training. Second, generalisation is on-distribution, not on $\epsilon$-balls. Empirical risk minimisation gives guarantees on points drawn from the training distribution; it gives no guarantees on every point within an $\epsilon$-ball of every training point. Robustness is a different objective, and adversarial training optimises it directly, at a cost in clean accuracy, sample complexity and compute. Third, practical networks have huge Lipschitz constants. The product of operator norms of weight matrices through a deep network can be many orders of magnitude larger than one, so a small input perturbation can become a large change in the logits. Constraining the Lipschitz constant by spectral normalisation, or by architecturally enforced 1-Lipschitz layers, is one of the few approaches with theoretical bite, but it constrains expressivity at the same time and tends to give clean accuracies several points below an unconstrained model on the same dataset.

There is a fourth, deeper reason that surfaces when one asks what robustness is for. The $\ell_p$ threat model is a mathematical convenience: it gives clean budgets, smooth optimisation and a neat geometric picture. But the threat model that matters in deployment is rarely an $\ell_\infty$ ball. A real attacker rotates the image, occludes part of it, prints a sticker, photographs from a different angle, or in the language case rephrases the request. Robustness to $\epsilon = 8/255$ in $\ell_\infty$ is necessary but nowhere near sufficient for any of those, and the literature on natural distribution shift (different lighting, different sensors, different demographics) is a separate strand that interacts with adversarial robustness only weakly. The field's standing position is that adversarial robustness is a real cost that has to be paid in capacity, in compute or in clean accuracy, and that pretending otherwise is what produces the broken-defence literature.

What you should take away

Adversarial examples are tiny, gradient-crafted perturbations that flip predictions with high confidence; in vision they are imperceptible to humans, and they transfer between models.
FGSM is the one-step linearised attack; PGD is the iterated, projected version and is the standard against which defences must be measured.
Adversarial training, the min-max formulation that includes PGD-found examples in the training loop, is the dominant defence; randomised smoothing gives a weaker but certified guarantee.
Most empirical defences proposed before 2018 were broken by adaptive attacks: robustness must be evaluated under a strong attacker who knows the defence, not under the defender's chosen suite.
LLM jailbreaks, GCG suffixes, Crescendo, many-shot, indirect prompt injection, are the language-model analogue of input-space adversarial attacks, and the consensus in 2026 is that they cannot be eliminated, only mitigated through capability-bounded tools, role-conditioned trust and human-confirmed high-stakes actions.