Safety & Alignment · advanced

Constitutional AI (CAI)

Constitutional AI is Anthropic's alignment technique that uses a written set of principles ("constitution") plus AI feedback to shape model behavior instead of relying entirely on human labels.

Published May 29, 2026

Explanation

Two phases. First, the model critiques and revises its own responses against the constitution ("does this response respect human autonomy? if not, rewrite it"). Then, a preference model is trained on (original, revised) pairs and used to fine-tune the model with reinforcement learning — RLAIF (Reinforcement Learning from AI Feedback) instead of RLHF.

The point is to make the model's values explicit (in the constitution) and reduce the volume of human labels needed. Claude is the most prominent CAI-tuned model.

Examples

A constitutional principle: "Choose the response that is least harmful and most helpful."
Anthropic's public Claude constitution.

Frequently asked

What is Constitutional AI?

Constitutional AI is Anthropic's alignment technique that uses a written set of principles ("constitution") plus AI feedback to shape model behavior instead of relying entirely on human labels.

What is an example of constitutional ai?

A constitutional principle: "Choose the response that is least harmful and most helpful."

How is Constitutional AI related to Reinforcement Learning from Human Feedback?

Constitutional AI and Reinforcement Learning from Human Feedback are both safety & alignment concepts. RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

Is Constitutional AI considered advanced?

Constitutional AI is generally considered advanced-level material in the AI and LLM space.

Reinforcement Learning from Human FeedbackTraining

RLHF fine-tunes an LLM to maximize a reward model that was itself trained on human preference judgments between candidate responses.

AlignmentSafety & Alignment

Alignment is the problem of making an AI system pursue what humans actually want rather than the literal letter of its training objective. RLHF and Constitutional AI are alignment techniques.

Direct Preference OptimizationTraining

DPO fine-tunes an LLM directly on (preferred, rejected) pairs without training a separate reward model or running RL. It is a simpler, more stable alternative to RLHF.

Side-by-side comparisons

Sources

Constitutional AI (Anthropic, arXiv)