Glossary

Adversarial Example

An Adversarial Example is an input deliberately crafted with small, often imperceptible perturbations that cause a machine learning model to produce wildly incorrect outputs. The phenomenon was first highlighted by Szegedy et al. (2013), who showed that image classifiers could be fooled by perturbations so subtle that the modified images looked identical to humans. A stop sign with a few carefully placed stickers might be classified as a 45 mph speed limit sign; a panda with added noise becomes a gibbon.

Adversarial examples pose serious safety and security concerns. They can be made physically realisable—adversarial stickers on road signs, adversarial patches on glasses, adversarial patterns on clothing that defeat pedestrian detection. They often transfer between models, meaning an attacker who can query one model can craft examples that fool another independently trained model. This suggests adversarial vulnerability is not an artefact of any particular architecture but a deep property of high-dimensional classifiers.

Defences include adversarial training (augmenting training data with adversarial examples), certified defences based on randomised smoothing or Lipschitz constraints, and input preprocessing. None has achieved robust defence against adaptive attacks. Adversarial robustness remains an active research area, raising both practical concerns (can we deploy AI in security-critical settings?) and theoretical questions (what does the vulnerability tell us about how neural networks actually generalise?). The contrast between human and machine perception of adversarial examples highlights that despite superficial similarities, deep networks process inputs in ways fundamentally unlike biological vision.

Related terms: AI Safety

Discussed in:

Also defined in: Textbook of AI