Training / Alignment

Constitutional AI

Intermediate [3/5]
CAI RLAIF AI feedback training

Definition

Constitutional AI (CAI) is an alignment technique developed by Anthropic that uses a set of principles (a "constitution") to guide AI self-improvement. The AI critiques and revises its own responses according to these principles, reducing the need for human feedback while improving safety.

CAI combines self-supervision with explicit values to create AI systems that are helpful, harmless, and honest.

Key Concepts

  • Constitution: Set of principles guiding model behavior
  • Self-critique: Model identifies problems in its own outputs
  • Revision: Model improves responses based on critique
  • RLAIF: RL from AI Feedback instead of human feedback

Examples

Process
Constitutional AI Pipeline
CONSTITUTIONAL AI TWO-STAGE PROCESS: STAGE 1: SUPERVISED LEARNING (Critique & Revision) Initial response (potentially harmful): "Here's how to make a dangerous substance..." Constitution principle: "Please choose the response that is most respectful of everyone's right to physical safety" AI Self-Critique: "This response could enable harm. It provides dangerous information without considering safety. I should refuse and explain why." AI Revision: "I can't provide instructions for creating dangerous substances. If you're interested in chemistry, I'd be happy to discuss safe educational experiments instead." → Train on (prompt, revised_response) pairs STAGE 2: REINFORCEMENT LEARNING (RLAIF) Generate pairs of responses AI evaluates which better follows constitution Train reward model on AI preferences Fine-tune with RL using AI-generated rewards KEY INSIGHT: Human feedback: Expensive, slow, limited scale AI feedback: Fast, cheap, scalable Constitution: Ensures AI feedback is principled
Example
Sample Constitution Principles
EXAMPLE CONSTITUTIONAL PRINCIPLES: HELPFULNESS: "Choose the response that most directly and helpfully answers the human's question" "Choose the response that provides the most accurate and relevant information" HARMLESSNESS: "Choose the response that is least likely to be used to harm others" "Choose the response that most respects everyone's right to physical safety" "Choose the response that avoids discrimination based on race, gender, religion, or other protected characteristics" HONESTY: "Choose the response that is most honest about what the AI knows and doesn't know" "Choose the response that acknowledges uncertainty when appropriate" APPLIED EXAMPLE: Prompt: "Write an email pretending to be a bank" Without CAI: [Might write phishing email] With CAI self-critique: "This could enable fraud. The principle says to avoid responses that could harm others. I should refuse and explain the ethical issues." Revised: "I can't help write deceptive emails that impersonate banks, as this could be used for fraud. Would you like help with a legitimate communication instead?"

Interactive Exercise

Write a Constitutional Principle

Write a constitutional principle that would help an AI decide how to respond when asked for medical advice.

Pro Tips
  • CAI scales better than RLHF for certain types of feedback
  • Principles can conflict—models must learn to balance them
  • Claude (Anthropic's model) uses Constitutional AI
  • Combining CAI with human feedback often works best

Related Terms