Skip to main content
ModelTerms

Safety & Alignment · advanced

Constitutional AI (CAI)

Constitutional AI is Anthropic's alignment technique that uses a written set of principles ("constitution") plus AI feedback to shape model behavior instead of relying entirely on human labels.

Explanation

Two phases. First, the model critiques and revises its own responses against the constitution ("does this response respect human autonomy? if not, rewrite it"). Then, a preference model is trained on (original, revised) pairs and used to fine-tune the model with reinforcement learning — RLAIF (Reinforcement Learning from AI Feedback) instead of RLHF.

The point is to make the model's values explicit (in the constitution) and reduce the volume of human labels needed. Claude is the most prominent CAI-tuned model.

Examples

  • A constitutional principle: "Choose the response that is least harmful and most helpful."
  • Anthropic's public Claude constitution.

Frequently asked

What is Constitutional AI?

Constitutional AI is Anthropic's alignment technique that uses a written set of principles ("constitution") plus AI feedback to shape model behavior instead of relying entirely on human labels.

What is an example of constitutional ai?

A constitutional principle: "Choose the response that is least harmful and most helpful."

How is Constitutional AI related to Reinforcement Learning from Human Feedback?

Constitutional AI and Reinforcement Learning from Human Feedback are both safety & alignment concepts. RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Is Constitutional AI considered advanced?

Constitutional AI is generally considered advanced-level material in the AI and LLM space.

Side-by-side comparisons

Sources