References

A General Theoretical Paradigm to Understand Learning from Human Preferences

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, & Rémi Munos (2023)

arXiv:2310.12036.

URL: https://arxiv.org/abs/2310.12036

Abstract. DeepMind's theoretical analysis of preference learning. Generalises RLHF and DPO under a single objective ($\Psi$PO) and identifies a failure mode in DPO: when preferences are deterministic, the implicit reward gap is unbounded and the model overfits. Proposes IPO (Identity Preference Optimisation), which replaces the sigmoid in DPO's loss with an identity transformation, regularising the implicit reward and avoiding the saturation issue. IPO became one of the standard alternatives in the post-DPO preference-optimisation literature.

Tags: alignment rlhf preference-learning

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).