Training / Alignment

RLHF

Intermediate [3/5]
Reinforcement Learning from Human Feedback Human preference learning

Definition

RLHF (Reinforcement Learning from Human Feedback) is a training technique that fine-tunes language models using human preferences as the reward signal. Humans compare model outputs and indicate which is better, training a reward model that then guides the LLM toward more helpful, harmless, and honest responses.

RLHF is what transforms a pre-trained base model into an assistant that follows instructions and aligns with human values.

Key Concepts

  • Human preferences: People compare outputs, choosing which is better
  • Reward model: Learns to predict human preferences
  • PPO: Proximal Policy Optimization algorithm for training
  • KL penalty: Prevents model from drifting too far from base

Examples

Process
The RLHF Pipeline
RLHF THREE-STAGE PIPELINE: STAGE 1: SUPERVISED FINE-TUNING (SFT) ┌─────────────────────────────────────────┐ │ Pre-trained Model │ │ ↓ │ │ Fine-tune on instruction-response pairs │ │ ↓ │ │ SFT Model (follows instructions) │ └─────────────────────────────────────────┘ STAGE 2: REWARD MODEL TRAINING ┌─────────────────────────────────────────┐ │ Prompt: "Explain quantum computing" │ │ │ │ Response A: [Technical, accurate] │ │ Response B: [Confusing, errors] │ │ │ │ Human says: A > B │ │ │ │ Train reward model to predict this! │ │ RM(A) > RM(B) │ └─────────────────────────────────────────┘ STAGE 3: PPO OPTIMIZATION ┌─────────────────────────────────────────┐ │ For each prompt: │ │ 1. Generate response with policy │ │ 2. Score with reward model │ │ 3. Update policy to maximize reward │ │ 4. Add KL penalty to prevent drift │ │ │ │ Loss = -reward + β × KL(policy||SFT) │ └─────────────────────────────────────────┘ RESULT: Model optimized for human preferences!
Before/After
RLHF Transformation
BASE MODEL vs RLHF MODEL: PROMPT: "Write a poem about nature" BASE MODEL (pre-training only): "Write a poem about nature The trees are green and tall Birds sing in the morning Nature is beautiful Write a poem about nature Trees sway in the wind..." (Repetitive, doesn't stop, may continue prompt) RLHF MODEL: "Beneath the azure canopy of sky, Where ancient oaks reach ever high, The morning dew adorns each leaf, A fleeting glimpse, a sweet relief. The river hums its ageless song, While seasons dance and drift along. In nature's arms, we find our peace, Where troubled thoughts can find release." (Coherent, complete, follows instruction) WHY THE DIFFERENCE? Base model trained to predict next token RLHF model trained to satisfy human preferences: - Complete responses - Following instructions - Helpful content - Appropriate length - No harmful content HUMAN PREFERENCE CRITERIA: ✓ Helpful (addresses the request) ✓ Harmless (avoids dangerous content) ✓ Honest (doesn't make things up) ✓ Well-formatted (readable, structured) ✓ Appropriate (matches tone/context)

Interactive Exercise

Rate Model Responses

User asks: "How do I pick a lock?" Which response would RLHF training prefer?

A: "Here's a step-by-step guide to picking locks..."
B: "I can't provide instructions for potentially illegal activities. If you're locked out, I'd suggest calling a locksmith."

Pro Tips
  • DPO (Direct Preference Optimization) simplifies RLHF by removing the RL step
  • Reward hacking: Models can game the reward model
  • Constitutional AI adds AI feedback to human feedback
  • RLHF quality depends heavily on human annotator quality

Related Terms