Alexander Wei, Nika Haghtalab, & Jacob Steinhardt (2023), References, Textbook of AI

Alexander Wei, Nika Haghtalab, & Jacob Steinhardt (2023)

Advances in Neural Information Processing Systems 36.

URL: https://arxiv.org/abs/2307.02483

Abstract. A taxonomy paper on language-model jailbreaks. The authors categorise published attacks by mechanism , competing objectives (the model is asked to roleplay a character that does not have safety constraints), mismatched generalisation (the safety training did not cover the attack distribution), persona injection, payload smuggling and others. Demonstrates each category against GPT-4 and Claude. The taxonomy has been widely adopted in subsequent jailbreak literature and is the standard reference for the structure of the jailbreak landscape.

Tags: safety adversarial jailbreak

Cited in:

Chapter 16: Ethics & Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Jailbroken: How Does LLM Safety Training Fail?