Constitutional AI: Harmlessness from AI Feedback
Bai, Kadavath, Kundu, et al.
What it says
Anthropic proposes a two-stage alternative to collecting lots of human harmlessness labels. First, an SFT stage where the model critiques and revises its own outputs against a written constitution of principles. Second, an RL stage (“RLAIF”) where preference labels between two model outputs are generated by another model comparing them to the constitution, and those synthetic preferences train a reward model the same way human preferences would. Humans still provide helpfulness labels but not harmlessness labels.
Why it matters
Constitutional AI is a core ingredient of Claude’s training and one of the more influential alignment papers of the post-ChatGPT era. It popularized RLAIF (RL from AI feedback), showed that written principles can make model behavior more auditable than implicit labeler preferences, and made the case that scalable oversight is possible without infinite labeling budgets.
Read next
- InstructGPT (Ouyang et al, 2022) — the classical RLHF recipe CAI builds on.
- Sparrow (Glaese et al, 2022) — DeepMind’s rule-based alignment approach.
- Scalable Oversight research (Bowman et al, 2022) — the broader motivation for AI-assisted supervision.