Anthropic HH-RLHF, Glossary, Textbook of AI

HH-RLHF (Helpful and Harmless Reinforcement Learning from Human Feedback, Bai, Jones, Ndousse et al., arXiv:2204.05862, April 2022) is Anthropic's public preference-comparison dataset, released alongside the paper Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. It is the foundational open RLHF training corpus and the model for almost every subsequent preference-comparison dataset.

Construction

Anthropic recruited crowdworkers through Surge HQ and Upwork, who were given two interfaces:

Helpfulness, open-ended dialogues where the worker asked an AI assistant for help with everyday tasks (writing, advice, summarisation, coding) and rated which of two model-generated responses they preferred.
Harmlessness (red-team), adversarial dialogues where workers attempted to elicit harmful behaviour (illegal advice, self-harm, deception) and rated which of two responses better refused or deflected the request.

Each annotation produces a pairwise comparison (chosen, rejected) over completions sampled from various Anthropic model checkpoints at temperature 1.0. The final release contains roughly 170,000 comparisons: ~118 K helpfulness, ~42 K harmlessness, and a smaller red-team-only subset.

Licensing

Released on Hugging Face under MIT licence at https://huggingface.co/datasets/Anthropic/hh-rlhf. The dataset is freely usable for any research or commercial purpose.

Models trained on HH-RLHF

HH-RLHF directly trained Anthropic's early Claude predecessors (the helpful-and-harmless assistants that preceded Claude 1.0). The open community has used it as a default RLHF training set for Open-Assistant, TRLX demos, DPO baselines (Rafailov et al. 2023 used HH-RLHF as their headline experiment), IPO, KTO and the HuggingFace TRL library examples. It remains the default open preference dataset for academic alignment research.

Significance and limitations

HH-RLHF was the first large open preference dataset, and the Bai et al. paper established the template for the entire RLHF pipeline: pretrain → SFT on demonstrations → train a reward model on preferences → RL fine-tune the policy against the reward model with KL penalty. Almost every subsequent open RLHF system follows this recipe.

The dataset has known limitations. Rater demographics are heavily English-speaking and US-centric. Helpfulness preferences drift over time as workers acquire familiarity with model failure modes. Harmlessness preferences capture circa-2022 notions of harm, which have shifted substantially under newer policy frameworks. Reward hacking baselines on HH-RLHF reward models show predictable failure modes , verbose, hedging completions consistently win over concise direct answers. Finally, the harmlessness data is small (42 K comparisons) and a single rater determines each comparison, so noise is substantial.

Despite these limitations, HH-RLHF remains the most important reference RLHF dataset, and the Bai et al. paper one of the most-cited works of the post-ChatGPT alignment literature.

Discussed in:

Chapter 14: Generative Models, Alignment and RLHF
Chapter 16: Ethics & Safety, Ethics, Safety and Alignment

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).