References

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, & Jack Clark (2022)

arXiv:2209.07858.

URL: https://arxiv.org/abs/2209.07858

Abstract. Anthropic's empirical study of human red teaming on language models. Reports a dataset of 38,961 red-team attacks across four model sizes and four model types (plain language model, prompted, RLHF-tuned, rejection-sampling), describes the taxonomy of harmful outputs that emerged, and analyses how attack success rates scale with model capability. The released red-team transcript dataset became a foundational resource for subsequent harms-research and informed industry-wide red-teaming protocols.

Tags: safety red-teaming alignment

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).