Constitutional AI (CAI), introduced by Bai et al. at Anthropic in 2022, is a method for aligning large language models that uses AI-generated feedback rather than human feedback for the harmlessness component of training. A "constitution" of natural-language principles , covering desired behaviours like honesty, harmlessness, helpfulness, is used both to critique model outputs and to refine them; the model is then trained on these self-generated improvements.
The pipeline has two stages: Constitutional SFT: starting from an SFT model, generate harmful prompts, generate model responses, prompt the model to critique its responses according to the constitution, prompt the model to revise the responses, and fine-tune the model on the revised responses. RLAIF (Reinforcement Learning from AI Feedback): train a preference model on AI-generated comparisons (the model rates pairs of responses according to constitutional principles), then RL the policy against the AI-rated preference model.
The approach reduces dependence on expensive human raters for harmfulness judgements, allows alignment principles to be modified without re-collecting human-feedback data, and provides transparent natural-language documentation of the alignment objective. Constitutional AI is foundational to the Claude training pipeline at Anthropic.
Critics have noted that AI-generated feedback inherits the biases of the underlying model and may amplify them through the self-supervision loop. The technique has nonetheless been widely adopted and inspired the broader RLAIF literature.
Video
Related terms: RLHF, dario-amodei, AI Alignment
Discussed in:
- Chapter 16: Ethics & Safety, AI Safety