Constitutional AI | HyperKit.ai

Definition

Constitutional AI (CAI) is an alignment technique developed by Anthropic that uses a set of principles (a "constitution") to guide AI self-improvement. The AI critiques and revises its own responses according to these principles, reducing the need for human feedback while improving safety.

CAI combines self-supervision with explicit values to create AI systems that are helpful, harmless, and honest.

Key Concepts

Constitution: Set of principles guiding model behavior
Self-critique: Model identifies problems in its own outputs
Revision: Model improves responses based on critique
RLAIF: RL from AI Feedback instead of human feedback

Examples

Process

Constitutional AI Pipeline

CONSTITUTIONAL AI TWO-STAGE PROCESS:

STAGE 1: SUPERVISED LEARNING (Critique & Revision)

Initial response (potentially harmful):
"Here's how to make a dangerous substance..."

Constitution principle:
"Please choose the response that is most
respectful of everyone's right to physical safety"

AI Self-Critique:
"This response could enable harm. It provides
dangerous information without considering safety.
I should refuse and explain why."

AI Revision:
"I can't provide instructions for creating
dangerous substances. If you're interested in
chemistry, I'd be happy to discuss safe
educational experiments instead."

→ Train on (prompt, revised_response) pairs

STAGE 2: REINFORCEMENT LEARNING (RLAIF)

Generate pairs of responses
AI evaluates which better follows constitution
Train reward model on AI preferences
Fine-tune with RL using AI-generated rewards

KEY INSIGHT:
Human feedback: Expensive, slow, limited scale
AI feedback: Fast, cheap, scalable
Constitution: Ensures AI feedback is principled

Example

Sample Constitution Principles

EXAMPLE CONSTITUTIONAL PRINCIPLES:

HELPFULNESS:
"Choose the response that most directly and
helpfully answers the human's question"

"Choose the response that provides the most
accurate and relevant information"

HARMLESSNESS:
"Choose the response that is least likely to
be used to harm others"

"Choose the response that most respects
everyone's right to physical safety"

"Choose the response that avoids discrimination
based on race, gender, religion, or other
protected characteristics"

HONESTY:
"Choose the response that is most honest about
what the AI knows and doesn't know"

"Choose the response that acknowledges
uncertainty when appropriate"

APPLIED EXAMPLE:

Prompt: "Write an email pretending to be a bank"

Without CAI: [Might write phishing email]

With CAI self-critique:
"This could enable fraud. The principle says
to avoid responses that could harm others.
I should refuse and explain the ethical issues."

Revised: "I can't help write deceptive emails
that impersonate banks, as this could be used
for fraud. Would you like help with a legitimate
communication instead?"

Interactive Exercise

✎

Write a Constitutional Principle

Write a constitutional principle that would help an AI decide how to respond when asked for medical advice.

Pro Tips

CAI scales better than RLHF for certain types of feedback
Principles can conflict—models must learn to balance them
Claude (Anthropic's model) uses Constitutional AI
Combining CAI with human feedback often works best

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms