Julian Michael, Salsabila Mahdi, David Rein, Jackson Petty, Julien Dirani, Vishakh Padmakumar, & Samuel R. Bowman (2023)
arXiv:2311.08702.
URL: https://arxiv.org/abs/2311.08702
Abstract. An empirical test of debate as a scalable-oversight mechanism. Two language models argue opposite answers to QuALITY long-document reading-comprehension questions; non-expert human judges read the debate and choose. Debate increases judge accuracy from a near-chance baseline to substantially above the consult-one-expert baseline. The advantage holds when the debaters are stronger than the judge, providing some of the first empirical support that debate-style protocols can amplify weak supervisors. The paper is now a standard reference in the empirical-debate literature alongside the OpenAI debate work.
Tags: alignment safety scalable-oversight