References

SimPO: Simple Preference Optimization with a Reference-Free Reward

Yu Meng, Mengzhou Xia, & Danqi Chen (2024)

arXiv:2405.14734.

URL: https://arxiv.org/abs/2405.14734

Abstract. Introduces Simple Preference Optimisation (SimPO), a DPO-style objective that does not require a reference policy. SimPO replaces the implicit-reward $\log \pi_\theta(y \mid x) - \log \pi_\text{ref}(y \mid x)$ with the length-normalised average log-likelihood under the target policy alone, removing the second forward pass and freeing the GPU memory occupied by the frozen reference model. Also adds a margin parameter to the Bradley-Terry loss. SimPO matches or exceeds DPO on AlpacaEval, ArenaHard and MT-Bench at substantially lower training cost, and is part of the post-DPO simplification trend in preference learning.

Tags: alignment rlhf preference-learning

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).