Tianyu Gu, Brendan Dolan-Gavitt, & Siddharth Garg (2017)
arXiv:1708.06733.
URL: https://arxiv.org/abs/1708.06733
Abstract. The canonical demonstration of backdoor attacks in deep learning. Shows that an attacker who can poison a small fraction of training data (e.g., 1% of CIFAR-10 or street-sign images) by adding a small trigger pattern (a yellow square) and a target label can produce a network that behaves normally on clean inputs and reliably misclassifies any input containing the trigger. The trigger is invisible-but-arbitrary and survives subsequent fine-tuning. The paper crystallised the supply-chain risk: any model trained on data of unverified provenance is a candidate vector.
Tags: adversarial safety poisoning