References

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, & Matt Fredrikson (2023)

arXiv:2307.15043.

URL: https://arxiv.org/abs/2307.15043

Abstract. Introduces Greedy Coordinate Gradient (GCG), a white-box adversarial-suffix attack on language models. The attack appends a sequence of optimised tokens to a harmful prompt; gradient information is used to greedily select replacements that maximise the probability of a harmful response. Demonstrates that the resulting suffixes transfer surprisingly well across closed models (GPT-4, Claude, Gemini) trained on disjoint data, in addition to working on the open Vicuna and LLaMA-2 they were optimised against. The paper established universal-suffix attacks as a major class of LLM jailbreaks and motivated subsequent defensive work.

Tags: safety adversarial jailbreak

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).