Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, & Jeffrey Wu (2023), References, Textbook of AI

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, & Jeffrey Wu (2023)

arXiv:2312.09390.

URL: https://arxiv.org/abs/2312.09390

Abstract. OpenAI superalignment-team paper. Frames the superhuman alignment problem analogically, fine-tune a strong student model (GPT-4 base) on labels generated by a weaker teacher model (GPT-2 level). Asks whether the strong student can outperform the weak supervisor by generalising the underlying concept rather than mimicking the weak labels. Reports partial success: simple methods recover roughly half the performance gap, suggesting weak-to-strong generalisation is a real phenomenon but a single technique is insufficient. The paper inaugurated a research programme on supervising models more capable than their supervisors.

Tags: alignment safety scalable-oversight

Cited in:

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision