2022 · arXiv 2022

Constitutional AI: Harmlessness from AI Feedback

Bai, Kadavath, Kundu, et al.

2022 arXiv 2022

TL;DR

Replace most human harmlessness labels with model-generated self-critiques guided by a written set of principles ("constitution"). Scales alignment data and gives more transparent rules.

Read paper

BACKLOG · WORK IN PROGRESS

This paper is being written.

The metadata and shape of this page are stable, but the body content isn't ready yet. We'll publish it once it meets the bar of teaching something new with worked examples and real tools.

Back to papers Track progress on GitHub