2022 arXiv 2022

Constitutional AI: Harmlessness from AI Feedback

Bai, Kadavath, Kundu, et al.

TL;DR
Replace most human harmlessness labels with model-generated self-critiques guided by a written set of principles ("constitution"). Scales alignment data and gives more transparent rules.

What it says

Anthropic proposes a two-stage alternative to collecting lots of human harmlessness labels. First, an SFT stage where the model critiques and revises its own outputs against a written constitution of principles. Second, an RL stage (“RLAIF”) where preference labels between two model outputs are generated by another model comparing them to the constitution, and those synthetic preferences train a reward model the same way human preferences would. Humans still provide helpfulness labels but not harmlessness labels.

Why it matters

Constitutional AI is a core ingredient of Claude’s training and one of the more influential alignment papers of the post-ChatGPT era. It popularized RLAIF (RL from AI feedback), showed that written principles can make model behavior more auditable than implicit labeler preferences, and made the case that scalable oversight is possible without infinite labeling budgets.

  • InstructGPT (Ouyang et al, 2022) — the classical RLHF recipe CAI builds on.
  • Sparrow (Glaese et al, 2022) — DeepMind’s rule-based alignment approach.
  • Scalable Oversight research (Bowman et al, 2022) — the broader motivation for AI-assisted supervision.