Perplexity | HyperKit.ai

Definition

Perplexity measures how "surprised" a language model is by a sequence of text. Lower perplexity means the model predicts the text more confidently and accurately. It's the exponential of the average negative log-likelihood per token.

Intuitively, perplexity represents the effective number of equally likely choices the model is considering at each step.

Key Concepts

Lower is better: Lower perplexity = better predictions
Intrinsic metric: Measures model quality without downstream tasks
Branching factor: PPL of N ≈ choosing between N equally likely options
Dataset dependent: Only comparable on same test data

Examples

Intuition

Understanding Perplexity

PERPLEXITY INTUITION:

Sentence: "The cat sat on the ___"

Model A (PPL = 5):
Effectively considering ~5 likely options
"mat" (40%), "floor" (25%), "couch" (15%), "bed" (10%), "rug" (10%)
→ Confident, focused predictions

Model B (PPL = 50):
Effectively considering ~50 likely options
Many tokens have similar small probabilities
→ Uncertain, scattered predictions

REAL-WORLD BENCHMARKS:
┌────────────────────────┬─────────────┐
│ Model                  │ PPL (WikiText)│
├────────────────────────┼─────────────┤
│ GPT-2 (small)          │ ~30         │
│ GPT-2 (large)          │ ~18         │
│ GPT-3                  │ ~10         │
│ GPT-4 class           │ ~5-8        │
└────────────────────────┴─────────────┘

Lower = model is less "perplexed" by text

Calculation

Computing Perplexity

FORMULA:
PPL = exp(-1/N × Σ log P(tokenᵢ | context))

EXAMPLE:
Text: "The cat sat"
Tokens: ["The", "cat", "sat"]

Model probabilities:
P("The" | start) = 0.1
P("cat" | "The") = 0.05
P("sat" | "The cat") = 0.2

Log probabilities:
log(0.1) = -2.30
log(0.05) = -3.00
log(0.2) = -1.61

Average log prob: (-2.30 - 3.00 - 1.61) / 3 = -2.30

Perplexity: exp(2.30) = 10.0

# Python implementation
import numpy as np

def perplexity(log_probs):
    """Calculate perplexity from log probabilities."""
    avg_neg_log_prob = -np.mean(log_probs)
    return np.exp(avg_neg_log_prob)

log_probs = [-2.30, -3.00, -1.61]
ppl = perplexity(log_probs)  # ≈ 10.0

Interactive Exercise

✎

Interpret Perplexity

Model A has perplexity 8 on a test set. Model B has perplexity 32. Which model is better and by roughly how much?

Pro Tips

Perplexity is only meaningful for comparing models on same data
Low perplexity doesn't guarantee good generation quality
Tokenization affects perplexity—can't compare across tokenizers
Use perplexity for model selection, not final evaluation

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms