Glossary

Constitutional AI Dataset

The Constitutional AI dataset is the synthetic critique-and-revise corpus produced by Anthropic's Constitutional AI method (Bai, Kadavath, Kundu et al., arXiv:2212.08073, December 2022). It is the training substrate for Claude's harmlessness alignment and the foundational example of RLAIF (Reinforcement Learning from AI Feedback).

The Constitutional AI procedure

Constitutional AI replaces human harmlessness labels with a two-stage pipeline:

  1. SL-CAI (Supervised Learning from Constitutional AI), Given an initial harmful response from a helpful-only model on a red-team prompt, prompt the same model with a randomly chosen constitutional principle asking it to critique the response, then revise the response to comply with the principle. The (prompt, revised-response) pairs form the SL-CAI dataset, which is used for supervised fine-tuning.

  2. RL-CAI (RL from AI Feedback), Use the SL-CAI model to generate paired completions for new prompts; have a separate AI model classify which completion better satisfies a randomly sampled constitutional principle; train a reward model on these AI-generated preferences; RL fine-tune the policy against the reward model.

The constitution itself is a list of roughly 75 short natural-language principles drawn from sources including the UN Declaration of Human Rights, Apple's Terms of Service, DeepMind's Sparrow rules, and Anthropic's own internal staff input.

Scale

Anthropic does not publish the SL-CAI or RL-CAI corpora in raw form, but the Bai et al. paper reports approximately 180 K SL-CAI critique-revise pairs and roughly 250 K RL-CAI preference comparisons for the published Claude predecessor experiments. Production-scale figures for Claude 1, 2 and 3 are not disclosed.

Licensing and availability

The Constitutional AI method is fully described in the published paper. The dataset itself is proprietary to Anthropic and has not been released. The community has approximated it with several open re-implementations: Constitutional AI from First Principles (Hugging Face TRL examples), SeaLLMs constitutional fine-tuning data, Aya Constitutional (Cohere For AI), and Stanford Direct CAI datasets.

Models trained on Constitutional AI data

The Constitutional AI dataset trained Claude 1.0 (March 2023) and underpins the alignment of Claude 2.0, Claude 2.1, and the Claude 3 model family (Haiku, Sonnet, Opus, March 2024). The method also informed the alignment recipes of DeepMind's Sparrow, Anthropic's Claude 4 (announced 2024), and various research efforts at Cohere and AI2.

Significance and critique

Constitutional AI demonstrated that explicit textual rules, applied through self-critique, can substitute for the expensive human harmlessness annotation pipeline that drives HH-RLHF. The approach has critics: the principles are themselves subjective and culturally bounded; the self-supervision loop may amplify rather than correct model biases; and legibility is illusory, what matters operationally is the policy's behaviour, not whether one can recite a constitution. Nonetheless Constitutional AI has been the most influential post-RLHF alignment proposal of the 2022-2024 period.

Related terms: Anthropic HH-RLHF, Constitutional AI, RLHF, Language Model

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).