Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, & Douwe Kiela (2024)
arXiv:2402.01306.
URL: https://arxiv.org/abs/2402.01306
Abstract. Introduces Kahneman-Tversky Optimisation (KTO), a preference-fine-tuning method that does not require paired preference data. Builds on prospect theory's value function: humans evaluate outcomes asymmetrically about a reference point, with losses weighted more heavily than gains. KTO trains on simple "good" or "bad" labels per response and uses an asymmetric loss derived from prospect theory. KTO matches or exceeds DPO on standard benchmarks while removing the labelled-pairs requirement, simplifying data collection.
Tags: alignment rlhf preference-learning
Cited in: