Anish Athalye, Nicholas Carlini, & David Wagner (2018)
International Conference on Machine Learning.
URL: https://arxiv.org/abs/1802.00420
Abstract. A landmark broken-defences paper. The authors review nine adversarial-defence methods accepted at ICLR 2018 and show that all but one rely on gradient masking, making the gradient unusable for white-box attacks while leaving the underlying classifier vulnerable to adaptive attacks. They introduce techniques (BPDA, expectation over transformation) for circumventing each form of obfuscation and reduce all evaluated defences to near-zero robust accuracy. The paper established the methodology that any new defence must be evaluated under adaptive attacks designed with knowledge of its mechanism.
Tags: adversarial safety robustness