Model Evaluation / Metrics

Perplexity

Advanced [4/5]
PPL Model perplexity

Definition

Perplexity measures how "surprised" a language model is by a sequence of text. Lower perplexity means the model predicts the text more confidently and accurately. It's the exponential of the average negative log-likelihood per token.

Intuitively, perplexity represents the effective number of equally likely choices the model is considering at each step.

Key Concepts

  • Lower is better: Lower perplexity = better predictions
  • Intrinsic metric: Measures model quality without downstream tasks
  • Branching factor: PPL of N ≈ choosing between N equally likely options
  • Dataset dependent: Only comparable on same test data

Examples

Intuition
Understanding Perplexity
PERPLEXITY INTUITION: Sentence: "The cat sat on the ___" Model A (PPL = 5): Effectively considering ~5 likely options "mat" (40%), "floor" (25%), "couch" (15%), "bed" (10%), "rug" (10%) → Confident, focused predictions Model B (PPL = 50): Effectively considering ~50 likely options Many tokens have similar small probabilities → Uncertain, scattered predictions REAL-WORLD BENCHMARKS: ┌────────────────────────┬─────────────┐ │ Model │ PPL (WikiText)│ ├────────────────────────┼─────────────┤ │ GPT-2 (small) │ ~30 │ │ GPT-2 (large) │ ~18 │ │ GPT-3 │ ~10 │ │ GPT-4 class │ ~5-8 │ └────────────────────────┴─────────────┘ Lower = model is less "perplexed" by text
Calculation
Computing Perplexity
FORMULA: PPL = exp(-1/N × Σ log P(tokenᵢ | context)) EXAMPLE: Text: "The cat sat" Tokens: ["The", "cat", "sat"] Model probabilities: P("The" | start) = 0.1 P("cat" | "The") = 0.05 P("sat" | "The cat") = 0.2 Log probabilities: log(0.1) = -2.30 log(0.05) = -3.00 log(0.2) = -1.61 Average log prob: (-2.30 - 3.00 - 1.61) / 3 = -2.30 Perplexity: exp(2.30) = 10.0 # Python implementation import numpy as np def perplexity(log_probs): """Calculate perplexity from log probabilities.""" avg_neg_log_prob = -np.mean(log_probs) return np.exp(avg_neg_log_prob) log_probs = [-2.30, -3.00, -1.61] ppl = perplexity(log_probs) # ≈ 10.0

Interactive Exercise

Interpret Perplexity

Model A has perplexity 8 on a test set. Model B has perplexity 32. Which model is better and by roughly how much?

Pro Tips
  • Perplexity is only meaningful for comparing models on same data
  • Low perplexity doesn't guarantee good generation quality
  • Tokenization affects perplexity—can't compare across tokenizers
  • Use perplexity for model selection, not final evaluation

Related Terms