An adversarial example: a tiny perturbation flips the prediction, Textbook of AI

Add an imperceptible pattern to a panda image; the network now sees a gibbon with high confidence.

From the chapter: Chapter 16: Ethics & Safety

Glossary: adversarial example, adversarial attack

Transcript

A photograph of a panda. The neural network classifies it correctly with 57 percent confidence.

Generate a small, carefully crafted perturbation. Each pixel changes by at most one part in a hundred. To human eyes, the image looks identical.

Add the perturbation to the original image. Pass through the network.

The network now classifies it as a gibbon, with 99 percent confidence.

How. The perturbation is the gradient of the loss with respect to the input, scaled small. It walks against the network's understanding of the panda class. Even tiny gradients accumulate over hundreds of dimensions.

This is the projected gradient descent attack, or its single-step cousin, the fast gradient sign method. They reliably fool any classifier given white-box access.

Black-box attacks transfer between models. A perturbation crafted on one network fools many others. Adversarial examples are not idiosyncratic; they exploit shared blind spots.

Adversarial training. Augment the training set with perturbed copies. The network becomes more robust, but rarely matches a clean classifier on clean inputs. There is a fundamental robustness-accuracy trade-off.

Why this matters. Self-driving cars must classify stop signs correctly even when stickers are added. Medical models must not be tricked by imaging artefacts. Adversarial examples reveal that high benchmark accuracy does not imply human-level perception.

The frontier. Provable robustness, certified bounds, randomised smoothing, and adversarial robustness research more broadly.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).