References

ORPO: Monolithic Preference Optimization without Reference Model

Jiwoo Hong, Noah Lee, & James Thorne (2024)

arXiv:2403.07691.

URL: https://arxiv.org/abs/2403.07691

Abstract. Introduces Odds-Ratio Preference Optimisation (ORPO), a preference-tuning method that combines supervised fine-tuning and preference optimisation in a single training stage with no reference model. The loss adds an odds-ratio term to the standard SFT cross-entropy: the model is rewarded for assigning higher probability to chosen responses than to rejected ones in odds-ratio space. ORPO removes the memory overhead of keeping a frozen reference model and the operational complexity of running SFT and DPO sequentially, while matching DPO performance on standard alignment benchmarks.

Tags: alignment rlhf preference-learning

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).