Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, & Jack Clark (2022)
Abstract. Anthropic's empirical study of human red teaming on language models. Reports a dataset of 38,961 red-team attacks across four model sizes and four model types (plain language model, prompted, RLHF-tuned, rejection-sampling), describes the taxonomy of harmful outputs that emerged, and analyses how attack success rates scale with model capability. The released red-team transcript dataset became a foundational resource for subsequent harms-research and informed industry-wide red-teaming protocols.