References

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, & Douwe Kiela (2024)

arXiv:2402.01306.

URL: https://arxiv.org/abs/2402.01306

Abstract. Introduces Kahneman-Tversky Optimisation (KTO), a preference-fine-tuning method that does not require paired preference data. Builds on prospect theory's value function: humans evaluate outcomes asymmetrically about a reference point, with losses weighted more heavily than gains. KTO trains on simple "good" or "bad" labels per response and uses an asymmetric loss derived from prospect theory. KTO matches or exceeds DPO on standard benchmarks while removing the labelled-pairs requirement, simplifying data collection.

Tags: alignment rlhf preference-learning

Cited in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).