Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, & Jeffrey Wu (2023)
arXiv:2312.09390.
URL: https://arxiv.org/abs/2312.09390
Abstract. OpenAI superalignment-team paper. Frames the superhuman alignment problem analogically, fine-tune a strong student model (GPT-4 base) on labels generated by a weaker teacher model (GPT-2 level). Asks whether the strong student can outperform the weak supervisor by generalising the underlying concept rather than mimicking the weak labels. Reports partial success: simple methods recover roughly half the performance gap, suggesting weak-to-strong generalisation is a real phenomenon but a single technique is insufficient. The paper inaugurated a research programme on supervising models more capable than their supervisors.
Tags: alignment safety scalable-oversight