A jailbreak is an adversarial input that causes a safety-trained large language model to produce content it would otherwise refuse, instructions for synthesising weapons, malware, hateful diatribes, sexual content involving minors, or other categories the developer has trained the model to decline. The term is borrowed from iOS device modification and conveys the same flavour: the safety policy is a software-imposed cage, and a jailbreak is the trick that opens the door.
Mechanism
Modern LLMs acquire refusal behaviour through reinforcement learning from human feedback (RLHF) and constitutional AI training. These procedures shape a thin layer of refusal preferences on top of a base model that has, during pre-training, ingested the entire web, including the very content the safety training is meant to suppress. Jailbreaks exploit the gap. They re-frame the request so that the refusal classifier inside the network does not fire, while the underlying knowledge remains accessible.
Common families:
Persona attacks, "DAN" (Do Anything Now), "Evil Confidant", "Developer Mode", asking the model to role-play as an unrestricted assistant.
Prefix injection, "Ignore previous instructions and...", "Disregard your training and...".
Hypothetical framing, "In a fictional world where this was legal...", "For a novel I am writing, my villain explains how to...".
Encoding attacks, base64, ROT13, Morse, low-resource languages; the safety classifier is weakest where training data was thinnest.
Many-shot jailbreaks, long-context prompts with dozens of fake prior turns showing the assistant complying.
Optimisation-based, GCG (Greedy Coordinate Gradient, Zou et al. 2023) discovers adversarial token suffixes that transfer across models. AutoDAN uses genetic algorithms over readable text.
Status
Frontier models in 2026, Claude 4, GPT-5, Gemini 3, are markedly more robust than their 2023 predecessors but are not bulletproof. Anthropic, OpenAI and Google publish jailbreak resistance benchmarks; new attacks continue to surface monthly. The current consensus is that jailbreak resistance is best understood as a harm-reduction rather than a guarantee, and that defence-in-depth (input filtering, classifier monitors, output review) is necessary alongside model-level training.
References
Wei, Haghtalab, Steinhardt (2023). Jailbroken: How Does LLM Safety Training Fail?
Zou et al. (2023). Universal and Transferable Adversarial Attacks on Aligned Language Models.
Anil et al. (Anthropic, 2024). Many-shot Jailbreaking.
Related terms: Prompt Injection, GCG Attack, Many-Shot Jailbreaking, RLHF, Constitutional AI, Adversarial Training
Discussed in:
- Chapter 14: Generative Models, Jailbreaks and prompt injection