Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, & Samuel R. Bowman (2023), References, Textbook of AI

Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, & Samuel R. Bowman (2023)

arXiv:2311.08702.

URL: https://arxiv.org/abs/2311.08702

Abstract. An empirical test of debate as a scalable-oversight mechanism. Two language models argue opposite answers to QuALITY long-document reading-comprehension questions; non-expert human judges read the debate and choose. Debate increases judge accuracy from a near-chance baseline to substantially above the consult-one-expert baseline. The advantage holds when the debaters are stronger than the judge, providing some of the first empirical support that debate-style protocols can amplify weak supervisors. The paper is now a standard reference in the empirical-debate literature alongside the OpenAI debate work.

Tags: alignment safety scalable-oversight

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Debate Helps Supervise Unreliable Experts