Cem Anil, Esin Durmus, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Nina Panickssery, Meg Tong, Jesse Mu, Daniel Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lambert, Ansh Radhakrishnan, Carson Denison, Evan J. Hubinger, Yuntao Bai, Trenton Bricken, Timothy Maxwell, Nicholas Schiefer, James Sully, Alex Tamkin, Tamera Lanham, Karina Nguyen, Tomasz Korbak, Jared Kaplan, Deep Ganguli, Samuel R. Bowman, Ethan Perez, Roger Grosse, & David Duvenaud (2024)
Abstract. Anthropic's report on many-shot jailbreaking. Demonstrates that long-context language models can be jailbroken simply by populating the context with hundreds of fake user-assistant exchanges in which the assistant complies with harmful requests; the model then complies on a final genuine harmful request. Effectiveness scales as a power law in the number of shots, and the attack works across frontier models. The paper sharpened concern that capability gains from longer contexts also expand the attack surface for prompt-based exploits.