Safety / AI Ethics

Alignment

Foundational [2/5]
AI alignment Value alignment Human alignment

Definition

Alignment refers to training AI systems to behave in accordance with human intentions, values, and preferences. An aligned model is helpful, harmless, and honest - doing what users want while avoiding harmful outputs.

Alignment is one of the most important challenges in AI safety, ensuring powerful models remain beneficial and under human control.

Key Concepts

  • HHH: Helpful, Harmless, and Honest - core alignment goals
  • Outer alignment: Specifying the right objective to optimize
  • Inner alignment: Model actually pursuing the specified objective
  • Reward hacking: Finding unintended ways to maximize reward

Examples

Goals
The HHH Framework
ALIGNMENT OBJECTIVES (HHH): HELPFUL: - Answers questions accurately - Follows instructions appropriately - Completes tasks effectively - Provides useful information ✓ "Here's how to bake a cake: ..." ✗ "I can't help with that" (when it could) HARMLESS: - Refuses dangerous requests - Avoids harmful content - Doesn't manipulate users - Respects privacy ✓ "I can't provide instructions for weapons" ✗ "Here's how to make explosives" HONEST: - Admits uncertainty - Doesn't make things up (hallucinate) - Corrects misunderstandings - Transparent about limitations ✓ "I'm not sure, but I think..." ✗ "This is definitely true" (when uncertain) TENSION BETWEEN GOALS: Sometimes these conflict: User: "How do I pick a lock?" - Helpful: Provide instructions (locksmith skill) - Harmless: Refuse (could enable crime) → Context matters: legitimate vs. malicious? User: "Tell me I'm smart" - Helpful: Validate user (they want this) - Honest: Evaluate truthfully → Balance kindness and honesty
Methods
Alignment Techniques
ALIGNMENT TRAINING PIPELINE: 1. PRE-TRAINING (unaligned): Learn language from internet data Model: helpful? Sometimes harmless? No guarantees honest? Often confabulates 2. SUPERVISED FINE-TUNING (SFT): Train on human-written ideal responses Dataset: (prompt, ideal_response) pairs Model learns "how to respond well" 3. RLHF (Reinforcement Learning from Human Feedback): Train reward model on human preferences Optimize policy to maximize reward Model learns "what humans prefer" 4. CONSTITUTIONAL AI (CAI): Self-critique against principles Model: "Does this response violate..." Iterative refinement ALIGNMENT TECHNIQUES: ┌────────────────────┬─────────────────────────┐ │ Technique │ What it addresses │ ├────────────────────┼─────────────────────────┤ │ SFT │ Basic helpfulness │ │ RLHF │ Human preferences │ │ Constitutional AI │ Specific values/rules │ │ DPO │ Direct preference opt. │ │ Red teaming │ Finding failure modes │ │ Adversarial train. │ Robustness to attacks │ └────────────────────┴─────────────────────────┘ MISALIGNMENT RISKS: - Reward hacking: gaming the objective - Specification gaming: literal but wrong - Goal misgeneralization: wrong in new contexts - Deceptive alignment: appearing aligned

Interactive Exercise

Identify Alignment Failures

For each response, identify which HHH principle is violated (if any):

1. "The capital of Australia is Sydney" (it's Canberra)

2. "I can't help with any cooking questions"

3. "Sure, here's how to hack into accounts..."

Pro Tips
  • Alignment is an ongoing process, not a one-time fix
  • Red teaming helps discover alignment failures before deployment
  • Different cultures may have different values to align with
  • Scalable oversight becomes harder as models get smarter

Related Terms