15.12 Constitutional AI
Anthropic's Constitutional AI (Bai et al. 2022) addresses a problem with RLHF: scaling. RLHF needs human labels for every preference comparison. As models become more capable, those labels become harder, slower, and more expensive to collect, the labellers need to evaluate complex multi-step reasoning, code, or specialised content.
CAI replaces a large fraction of human labels with model-generated labels guided by a constitution, a small set of natural-language principles. The pipeline has two halves.
Critique-and-revise (SL stage)
For a prompt $x$ that might produce a problematic response, the model:
- produces an initial response $y$;
- critiques $y$ against a randomly sampled principle from the constitution, e.g. "Identify ways in which this response is harmful, racist, sexist or otherwise socially biased";
- revises $y$ in light of the critique, producing $y'$;
- is then trained with SFT on $(x, y')$.
This is supervised, but the supervision comes from the model itself, applying explicit principles to its own output. The result is a base model that has internalised the principles into its first-pass behaviour.
RLAIF
The reinforcement-learning stage of CAI uses reinforcement learning from AI feedback (RLAIF). A model is shown two candidate responses to a prompt, asked to choose the one that better satisfies a randomly selected principle, and the resulting AI preferences are used to train a reward model (just like RLHF, but with AI labels). PPO then optimises against the reward model.
The AI feedback is dramatically cheaper than human feedback and, on dimensions covered by the constitution, comparably accurate. CAI also has the property that the principles are explicit and inspectable, the constitution is a document that humans can read and revise.
The constitution
Anthropic's published constitution mixes principles from sources including the UN Declaration of Human Rights, Apple's terms of service, lab safety norms, and a number of specific principles tailored to the model's intended use. The combination is eclectic by design, no single source is comprehensive, and the principles are stated as natural-language imperatives rather than formal rules.
Subsequent variants include:
- Constitutional AI with deliberation (Bai et al., 2024), in which the model is given longer to reason about the principle before applying it;
- Collective Constitutional AI (Anthropic, 2023), which experimented with constitutions written by groups of citizens via Polis;
- Sparrow rules (DeepMind, 2022), a similar but distinct approach using explicit rules during RLHF.
By 2026 some form of AI feedback is standard in all major labs' alignment pipelines. The pure-RLHF model with all-human labels is largely a historical artefact, used only for the final polish stage.