Alignment | HyperKit.ai

Definition

Alignment refers to training AI systems to behave in accordance with human intentions, values, and preferences. An aligned model is helpful, harmless, and honest - doing what users want while avoiding harmful outputs.

Alignment is one of the most important challenges in AI safety, ensuring powerful models remain beneficial and under human control.

Key Concepts

HHH: Helpful, Harmless, and Honest - core alignment goals
Outer alignment: Specifying the right objective to optimize
Inner alignment: Model actually pursuing the specified objective
Reward hacking: Finding unintended ways to maximize reward

Examples

Goals

The HHH Framework

ALIGNMENT OBJECTIVES (HHH):

HELPFUL:
- Answers questions accurately
- Follows instructions appropriately
- Completes tasks effectively
- Provides useful information
✓ "Here's how to bake a cake: ..."
✗ "I can't help with that" (when it could)

HARMLESS:
- Refuses dangerous requests
- Avoids harmful content
- Doesn't manipulate users
- Respects privacy
✓ "I can't provide instructions for weapons"
✗ "Here's how to make explosives"

HONEST:
- Admits uncertainty
- Doesn't make things up (hallucinate)
- Corrects misunderstandings
- Transparent about limitations
✓ "I'm not sure, but I think..."
✗ "This is definitely true" (when uncertain)

TENSION BETWEEN GOALS:
Sometimes these conflict:

User: "How do I pick a lock?"
- Helpful: Provide instructions (locksmith skill)
- Harmless: Refuse (could enable crime)
→ Context matters: legitimate vs. malicious?

User: "Tell me I'm smart"
- Helpful: Validate user (they want this)
- Honest: Evaluate truthfully
→ Balance kindness and honesty

Methods

Alignment Techniques

ALIGNMENT TRAINING PIPELINE:

1. PRE-TRAINING (unaligned):
   Learn language from internet data
   Model: helpful? Sometimes
          harmless? No guarantees
          honest? Often confabulates

2. SUPERVISED FINE-TUNING (SFT):
   Train on human-written ideal responses
   Dataset: (prompt, ideal_response) pairs
   Model learns "how to respond well"

3. RLHF (Reinforcement Learning from Human Feedback):
   Train reward model on human preferences
   Optimize policy to maximize reward
   Model learns "what humans prefer"

4. CONSTITUTIONAL AI (CAI):
   Self-critique against principles
   Model: "Does this response violate..."
   Iterative refinement

ALIGNMENT TECHNIQUES:
┌────────────────────┬─────────────────────────┐
│ Technique          │ What it addresses       │
├────────────────────┼─────────────────────────┤
│ SFT                │ Basic helpfulness       │
│ RLHF               │ Human preferences       │
│ Constitutional AI  │ Specific values/rules   │
│ DPO                │ Direct preference opt.  │
│ Red teaming        │ Finding failure modes   │
│ Adversarial train. │ Robustness to attacks   │
└────────────────────┴─────────────────────────┘

MISALIGNMENT RISKS:
- Reward hacking: gaming the objective
- Specification gaming: literal but wrong
- Goal misgeneralization: wrong in new contexts
- Deceptive alignment: appearing aligned

Interactive Exercise

✎

Identify Alignment Failures

For each response, identify which HHH principle is violated (if any):

1. "The capital of Australia is Sydney" (it's Canberra)

2. "I can't help with any cooking questions"

3. "Sure, here's how to hack into accounts..."

Pro Tips

Alignment is an ongoing process, not a one-time fix
Red teaming helps discover alignment failures before deployment
Different cultures may have different values to align with
Scalable oversight becomes harder as models get smarter

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms