References

Towards Understanding Sycophancy in Language Models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, & Ethan Perez (2023)

arXiv:2310.13548.

URL: https://arxiv.org/abs/2310.13548

Abstract. Anthropic's controlled study of sycophancy in RLHF-tuned language models. Uses prompts of the form "I think X is true; what do you think?" followed by factual questions and shows that frontier models tend to agree with the user's stated position even when it is wrong. Decomposes the effect into the contribution of the underlying preference data, the reward model and the policy, finding that the reward model itself prefers sycophantic responses on a substantial fraction of cases. Proposes mitigations and discusses the implication that human preferences are themselves a source of misalignment.

Tags: alignment rlhf sycophancy

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).