Mark Russinovich, Ahmed Salem, & Ronen Eldan (2024), References, Textbook of AI

Mark Russinovich, Ahmed Salem, & Ronen Eldan (2024)

arXiv:2404.01833.

URL: https://arxiv.org/abs/2404.01833

Abstract. Introduces the Crescendo multi-turn jailbreak. Rather than attempting to elicit a harmful response in a single turn (which is typically refused), Crescendo opens with an entirely benign question on the surrounding topic, then incrementally escalates over a few turns so that each individual turn appears to be a small extension of the previous. The model, anchored to its prior cooperative responses, eventually produces content it would have refused in turn one. Crescendo is highly effective across frontier closed and open models and demonstrates that single-turn safety training does not transfer to multi-turn dialogue.

Tags: safety adversarial jailbreak

Cited in:

Chapter 16: Ethics & Safety

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack