Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, & Rémi Munos (2023), References, Textbook of AI

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, & Rémi Munos (2023)

arXiv:2310.12036.

URL: https://arxiv.org/abs/2310.12036

Abstract. DeepMind's theoretical analysis of preference learning. Generalises RLHF and DPO under a single objective ($\Psi$PO) and identifies a failure mode in DPO: when preferences are deterministic, the implicit reward gap is unbounded and the model overfits. Proposes IPO (Identity Preference Optimisation), which replaces the sigmoid in DPO's loss with an identity transformation, regularising the implicit reward and avoiding the saturation issue. IPO became one of the standard alternatives in the post-DPO preference-optimisation literature.

Tags: alignment rlhf preference-learning

Cited in:

Chapter 15: Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

A General Theoretical Paradigm to Understand Learning from Human Preferences