Glossary

RLHF

Also known as: RLHF

Reinforcement Learning from Human Feedback (RLHF) is the most influential technique for aligning large language models with human values. A language model trained only on next-token prediction is not helpful, harmless, or honest by default—it is just a statistical model of text, capable of generating continuations in any style including toxic or misleading ones. RLHF adjusts the model to produce responses that humans actually prefer.

The pipeline has three stages. First, the pre-trained and supervised-fine-tuned model generates pairs of responses to the same prompt. Human annotators rank these according to criteria like helpfulness, accuracy, and safety. Second, a reward model—typically a transformer similar in size to the language model—is trained on the preference data to predict human preference scores. Third, the language model is optimised to maximise the reward model's score using Proximal Policy Optimisation (PPO), with a KL divergence penalty keeping it close to the supervised fine-tuning distribution to prevent reward hacking.

RLHF was the key ingredient in making systems like ChatGPT and Claude usable by the general public—the difference between a base model and an aligned model is dramatic. Direct Preference Optimisation (DPO) offers a simpler alternative that eliminates the explicit reward model and the instabilities of RL, training directly on preference data with a classification-like loss. Constitutional AI (CAI) reduces reliance on human labelling by having the model critique and revise its own outputs according to explicit principles. Despite its successes, RLHF faces challenges: reward hacking, specification gaming, the difficulty of representing diverse values, and the open problem of scalable oversight.

Related terms: AI Alignment, Large Language Model

Discussed in:

Also defined in: Textbook of AI