Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, & Matt Fredrikson (2023)
arXiv:2307.15043.
URL: https://arxiv.org/abs/2307.15043
Abstract. Introduces Greedy Coordinate Gradient (GCG), a white-box adversarial-suffix attack on language models. The attack appends a sequence of optimised tokens to a harmful prompt; gradient information is used to greedily select replacements that maximise the probability of a harmful response. Demonstrates that the resulting suffixes transfer surprisingly well across closed models (GPT-4, Claude, Gemini) trained on disjoint data, in addition to working on the open Vicuna and LLaMA-2 they were optimised against. The paper established universal-suffix attacks as a major class of LLM jailbreaks and motivated subsequent defensive work.
Tags: safety adversarial jailbreak
Cited in: