Jiwoo Hong, Noah Lee, & James Thorne (2024)
arXiv:2403.07691.
URL: https://arxiv.org/abs/2403.07691
Abstract. Introduces Odds-Ratio Preference Optimisation (ORPO), a preference-tuning method that combines supervised fine-tuning and preference optimisation in a single training stage with no reference model. The loss adds an odds-ratio term to the standard SFT cross-entropy: the model is rewarded for assigning higher probability to chosen responses than to rejected ones in odds-ratio space. ORPO removes the memory overhead of keeping a frozen reference model and the operational complexity of running SFT and DPO sequentially, while matching DPO performance on standard alignment benchmarks.
Tags: alignment rlhf preference-learning
Cited in: