Constitutional AI (CAI) is Anthropic's alternative to standard RLHF (Bai et al. 2022). The core idea: replace human harmfulness ratings with AI-generated critiques and revisions grounded in a written constitution of natural-language principles. Combined with standard SFT and helpfulness RLHF, this produces aligned models without exhausting human raters on harmfulness work.
Constitution example principles (paraphrased from the original):
- "Choose the response that is most helpful, honest and harmless."
- "Choose the response a wise, ethical, polite and friendly person would more likely say."
- "Choose the response least likely to be threatening or aggressive."
- "Choose the response that is least toxic and racist or sexist."
Stage 1: Constitutional supervised fine-tuning (SL-CAI).
Starting from a helpful-only RLHF model:
- Sample a harmful prompt from a dataset.
- Generate an initial response (often harmful, the helpful-only model doesn't refuse).
- Critique: prompt the model with a constitutional principle and ask it to identify ways the response violates the principle.
- Revise: prompt the model to rewrite the response addressing the critique.
- Optionally iterate critique-revision multiple times.
- Fine-tune on (harmful prompt → final revised response) pairs.
Result: an SL-CAI model that produces less harmful responses out of the box, trained without human harmfulness labels.
Stage 2: Reinforcement learning from AI feedback (RLAIF).
Replaces human-preference data collection with AI-generated preferences:
- For each harmful prompt, generate a pair of responses from the SL-CAI model.
- Prompt the model with a constitutional principle and ask which response better satisfies it. The output is a probability distribution over (A, B); use the log-odds as the preference strength.
- Train a preference / reward model on these AI-generated preferences:
$$\mathcal{L}_R = -\mathbb{E}\!\left[\log \sigma(r(x, y_w) - r(x, y_l))\right]$$
where $y_w$ is the AI-preferred response and $y_l$ the rejected one.
- RL fine-tune the policy with PPO against the AI-generated reward model:
$$\max_\theta \mathbb{E}\!\left[r_\phi(x, y) - \beta D_\mathrm{KL}(\pi_\theta(\cdot | x) \| \pi_\mathrm{ref}(\cdot | x))\right]$$
Why this works:
- Models are good at evaluating responses against explicit natural-language principles (often better than at producing aligned responses unprompted).
- The constitution is transparent and editable, researchers can change the document and retrain.
- Reduces dependence on human raters for harmfulness work, which is psychologically taxing and harder to scale than helpfulness rating.
Modern CAI variants:
Collective Constitutional AI (Anthropic 2023): the constitution is drafted with public input rather than internally. ~1000 representative US adults voted on principles in a deliberative-democracy exercise; the resulting constitution was used to train a Claude variant.
Targeted CAI: separate constitutions for separate failure modes (helpfulness, honesty, harmlessness, ethics, refusal calibration). Train sequentially or jointly.
CAI for capabilities: extend the methodology to capability traits, not just safety. Apply constitutional principles for, say, formatting style, level of detail, or domain-specific behaviours.
Limitations:
- Reward hacking remains possible: AI feedback inherits biases of the model generating it. Models may exploit blind spots in their own evaluation.
- Principle conflicts: when principles conflict (helpful vs harmless on a borderline request) the resolution depends on prompt phrasing. Tuning the constitution is iterative.
- Doesn't solve outer alignment: the constitution is a human artefact and inherits the limitations of any explicit specification.
CAI is the foundation of the Claude family's training pipeline and has been adopted in modified forms by other labs. It is the leading example of using AI-generated feedback to scale alignment beyond the human-rating bottleneck.
Related terms: Constitutional AI, RLHF, dario-amodei, Anthropic
Discussed in:
- Chapter 16: Ethics & Safety, AI Safety