Alexander Wei, Nika Haghtalab, & Jacob Steinhardt (2023)
Advances in Neural Information Processing Systems 36.
URL: https://arxiv.org/abs/2307.02483
Abstract. A taxonomy paper on language-model jailbreaks. The authors categorise published attacks by mechanism , competing objectives (the model is asked to roleplay a character that does not have safety constraints), mismatched generalisation (the safety training did not cover the attack distribution), persona injection, payload smuggling and others. Demonstrates each category against GPT-4 and Claude. The taxonomy has been widely adopted in subsequent jailbreak literature and is the standard reference for the structure of the jailbreak landscape.
Tags: safety adversarial jailbreak
Cited in: