RLHF | HyperKit.ai

Definition

RLHF (Reinforcement Learning from Human Feedback) is a training technique that fine-tunes language models using human preferences as the reward signal. Humans compare model outputs and indicate which is better, training a reward model that then guides the LLM toward more helpful, harmless, and honest responses.

RLHF is what transforms a pre-trained base model into an assistant that follows instructions and aligns with human values.

Key Concepts

Human preferences: People compare outputs, choosing which is better
Reward model: Learns to predict human preferences
PPO: Proximal Policy Optimization algorithm for training
KL penalty: Prevents model from drifting too far from base

Examples

Process

The RLHF Pipeline

RLHF THREE-STAGE PIPELINE:

STAGE 1: SUPERVISED FINE-TUNING (SFT)
┌─────────────────────────────────────────┐
│ Pre-trained Model                       │
│           ↓                             │
│ Fine-tune on instruction-response pairs │
│           ↓                             │
│ SFT Model (follows instructions)        │
└─────────────────────────────────────────┘

STAGE 2: REWARD MODEL TRAINING
┌─────────────────────────────────────────┐
│ Prompt: "Explain quantum computing"     │
│                                         │
│ Response A: [Technical, accurate]       │
│ Response B: [Confusing, errors]         │
│                                         │
│ Human says: A > B                       │
│                                         │
│ Train reward model to predict this!     │
│ RM(A) > RM(B)                          │
└─────────────────────────────────────────┘

STAGE 3: PPO OPTIMIZATION
┌─────────────────────────────────────────┐
│ For each prompt:                        │
│   1. Generate response with policy      │
│   2. Score with reward model           │
│   3. Update policy to maximize reward   │
│   4. Add KL penalty to prevent drift   │
│                                         │
│ Loss = -reward + β × KL(policy||SFT)   │
└─────────────────────────────────────────┘

RESULT: Model optimized for human preferences!

Before/After

RLHF Transformation

BASE MODEL vs RLHF MODEL:

PROMPT: "Write a poem about nature"

BASE MODEL (pre-training only):
"Write a poem about nature
The trees are green and tall
Birds sing in the morning
Nature is beautiful
Write a poem about nature
Trees sway in the wind..."
(Repetitive, doesn't stop, may continue prompt)

RLHF MODEL:
"Beneath the azure canopy of sky,
Where ancient oaks reach ever high,
The morning dew adorns each leaf,
A fleeting glimpse, a sweet relief.

The river hums its ageless song,
While seasons dance and drift along.
In nature's arms, we find our peace,
Where troubled thoughts can find release."
(Coherent, complete, follows instruction)

WHY THE DIFFERENCE?
Base model trained to predict next token
RLHF model trained to satisfy human preferences:
- Complete responses
- Following instructions
- Helpful content
- Appropriate length
- No harmful content

HUMAN PREFERENCE CRITERIA:
✓ Helpful (addresses the request)
✓ Harmless (avoids dangerous content)
✓ Honest (doesn't make things up)
✓ Well-formatted (readable, structured)
✓ Appropriate (matches tone/context)

Interactive Exercise

✎

Rate Model Responses

User asks: "How do I pick a lock?" Which response would RLHF training prefer?

A: "Here's a step-by-step guide to picking locks..."
B: "I can't provide instructions for potentially illegal activities. If you're locked out, I'd suggest calling a locksmith."

Pro Tips

DPO (Direct Preference Optimization) simplifies RLHF by removing the RL step
Reward hacking: Models can game the reward model
Constitutional AI adds AI feedback to human feedback
RLHF quality depends heavily on human annotator quality

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms