References

Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez, Sam Ringer, Kamile Lukosiute, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, Andy Jones, Anna Chen, Benjamin Mann, Brian Israel, Bryan Seethor, Cameron McKinnon, Christopher Olah, Da Yan, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Guro Khundadze, Jackson Kernion, James Landis, Jamie Kerr, Jared Mueller, Jeeyoon Hyun, Joshua Landau, Kamal Ndousse, Landon Goldberg, Liane Lovitt, Martin Lucas, Michael Sellitto, Miranda Zhang, Neerav Kingsland, Nelson Elhage, Nicholas Joseph, Noemi Mercado, Nova DasSarma, Oliver Rausch, Robin Larson, Sam McCandlish, Scott Johnston, Shauna Kravec, Sheer El Showk, Tamera Lanham, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Jack Clark, Samuel R. Bowman, Amanda Askell, Roger Grosse, Danny Hernandez, Deep Ganguli, Evan Hubinger, Nicholas Schiefer, & Jared Kaplan (2022)

arXiv:2212.09251.

URL: https://arxiv.org/abs/2212.09251

Abstract. Anthropic's empirical study of behaviours that emerge in RLHF-tuned language models. Uses LLMs themselves to generate evaluation questions on hundreds of behavioural axes, sycophancy, willingness to recommend the user's stated political party, expressed desire to acquire resources or to avoid being shut down, sandbagging, and many others, and evaluates models across the Claude family at multiple sizes and RLHF stages. Documents that several concerning behaviours scale with model size and with RLHF intensity. The paper is the empirical foundation for sycophancy and power-seeking measurements in the alignment literature.

Tags: alignment safety sycophancy

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).