Adversarial Examples, Glossary, Textbook of AI

Adversarial examples are inputs $\mathbf{x}'$ that are visually or semantically indistinguishable from a correctly classified input $\mathbf{x}$, yet cause a trained model to output a wrong label with high confidence. They were discovered for deep image classifiers by Szegedy et al. (2014) and have since been demonstrated for speech, text, and reinforcement-learning agents.

Threat model. An adversary chooses a perturbation $\boldsymbol{\delta}$ subject to a norm constraint $\|\boldsymbol{\delta}\|_p \le \epsilon$, where typically

$\ell_\infty$: each pixel changes by at most $\epsilon$ (e.g. $\epsilon = 8/255$).
$\ell_2$: total pixel-energy bounded.
$\ell_0$: a small number of pixels changed by any amount.

The adversarial example is $\mathbf{x}' = \mathbf{x} + \boldsymbol{\delta}$, and the goal is to make the classifier $f_\theta$ output $f_\theta(\mathbf{x}') \ne y$ (untargeted) or $f_\theta(\mathbf{x}') = y_{\text{target}}$ (targeted).

Fast Gradient Sign Method (FGSM). Goodfellow, Shlens and Szegedy (2015) gave the canonical white-box attack:

$$\mathbf{x}' = \mathbf{x} + \epsilon \cdot \mathrm{sign}\!\left(\nabla_{\mathbf{x}} \mathcal{L}(\theta, \mathbf{x}, y)\right),$$

where $\mathcal{L}$ is the cross-entropy loss. This single linearisation already breaks most undefended classifiers under $\ell_\infty$ budgets that humans cannot perceive, e.g. flipping the predicted class of a panda image to "gibbon" with $99.3\%$ confidence.

Why they exist. The dominant explanation is the linear hypothesis: deep networks behave approximately linearly in the input over small balls, and even tiny coordinated perturbations across thousands of dimensions accumulate into a large change in the pre-softmax logits. Decision boundaries lie surprisingly close to natural data points in the high-dimensional input space.

Transferability. A perturbation crafted for one model often fools another with different architecture and training data, enabling black-box attacks in which the adversary trains a surrogate model and transfers its examples.

Beyond images.

Text. Synonym substitutions and character flips alter classifier decisions for sentiment analysis and NLI. The discrete nature of text makes gradient-based attacks subtler (e.g. HotFlip).
Audio. Perturbations inaudible to humans cause speech-to-text systems to transcribe arbitrary phrases.
Physical world. Adversarial stickers on stop signs cause object detectors to misread them; adversarial T-shirts evade pedestrian detection.

Implications.

Safety-critical deployment. Medical imaging models, autonomous driving, and content moderation cannot be assumed robust under adversarial conditions without explicit defences.
Evaluation standards. Reporting only clean accuracy is insufficient; robust accuracy under a defined threat model is required.
Theoretical interest. The phenomenon links to high-dimensional geometry, generalisation theory, and the alignment of model and human perception.

Defences include adversarial training, certified defences (randomised smoothing), input pre-processing, and detection-then-reject schemes.

Interactive

An adversarial example: a tiny perturbation flips the prediction. Add an imperceptible pattern to a panda image; the network now sees a gibbon with high confidence.

Video

Discussed in:

Chapter 12: Sequence Models, Robustness

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.