Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, & Douwe Kiela (2024), References, Textbook of AI

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, & Douwe Kiela (2024)

arXiv:2402.01306.

URL: https://arxiv.org/abs/2402.01306

Abstract. Introduces Kahneman-Tversky Optimisation (KTO), a preference-fine-tuning method that does not require paired preference data. Builds on prospect theory's value function: humans evaluate outcomes asymmetrically about a reference point, with losses weighted more heavily than gains. KTO trains on simple "good" or "bad" labels per response and uses an asymmetric loss derived from prospect theory. KTO matches or exceeds DPO on standard benchmarks while removing the labelled-pairs requirement, simplifying data collection.

Tags: alignment rlhf preference-learning

Cited in:

Chapter 15: Modern AI

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).

KTO: Model Alignment as Prospect Theoretic Optimization