Geoffrey Irving, Paul Christiano, & Dario Amodei (2018)
arXiv:1805.00899.
URL: https://arxiv.org/abs/1805.00899
Abstract. Proposes AI safety via debate as a scalable-oversight protocol. Two AI agents are given the same question and assigned opposing positions; they argue back and forth, and a (relatively weak) human judge picks the winner. Under suitable assumptions about the judge's epistemic competence on individual exchanges, the equilibrium of the debate game corresponds to truthful answers to questions where verifying a claim is easier than generating one. The proposal inspired a substantial empirical follow-up literature on debate-style protocols and remains one of the standard candidate solutions to the supervising-superhuman-agents problem.
Tags: alignment safety scalable-oversight