Evaluation / Metrics

BLEU Score

Intermediate [3/5]
Bilingual Evaluation Understudy BLEU metric

Definition

BLEU (Bilingual Evaluation Understudy) is an automated metric for evaluating machine-generated text against reference translations. It measures n-gram precision - how many n-grams in the generated text appear in the reference.

BLEU scores range from 0 to 1 (or 0-100), where higher scores indicate better overlap with reference text.

Key Concepts

  • N-gram precision: Matches 1-grams, 2-grams, 3-grams, 4-grams
  • Brevity penalty: Penalizes outputs shorter than reference
  • Multiple references: Can compare against several valid translations
  • Corpus-level: More reliable on document/corpus than sentences

Examples

Calculation
BLEU Score Formula
BLEU SCORE CALCULATION: BLEU = BP × exp(Σ wₙ × log(pₙ)) Where: - BP = Brevity Penalty - pₙ = modified precision for n-grams - wₙ = weights (usually uniform: 1/4 each) STEP-BY-STEP EXAMPLE: Reference: "The cat sat on the mat" Candidate: "The cat is on the mat" 1-GRAMS (unigrams): Candidate: [the, cat, is, on, the, mat] Reference: [the, cat, sat, on, the, mat] Matches: the(2), cat(1), on(1), mat(1) = 5/6 p₁ = 5/6 = 0.833 2-GRAMS (bigrams): Candidate: [the cat, cat is, is on, on the, the mat] Reference: [the cat, cat sat, sat on, on the, the mat] Matches: the cat(1), on the(1), the mat(1) = 3/5 p₂ = 3/5 = 0.6 3-GRAMS: Candidate: [the cat is, cat is on, is on the, on the mat] Reference: [the cat sat, cat sat on, sat on the, on the mat] Matches: on the mat(1) = 1/4 p₃ = 1/4 = 0.25 4-GRAMS: Candidate: [the cat is on, cat is on the, is on the mat] Reference: [the cat sat on, cat sat on the, sat on the mat] Matches: 0/3 p₄ = 0/3 = 0 BREVITY PENALTY: c = candidate length = 6 r = reference length = 6 BP = exp(1 - r/c) if c < r, else 1 BP = 1 (same length) BLEU = 1 × exp(0.25 × (log(0.833) + log(0.6) + log(0.25) + log(ε))) ≈ 0.30 (30%)
Usage
BLEU in Practice
USING BLEU (Python): from sacrebleu import corpus_bleu # Single reference reference = [["The cat sat on the mat"]] candidate = ["The cat is on the mat"] bleu = corpus_bleu(candidate, reference) print(f"BLEU: {bleu.score:.2f}") # ~30 # Multiple references (better!) references = [ ["The cat sat on the mat"], ["A cat is sitting on the mat"], ["The cat was on the mat"] ] candidates = ["The cat is on the mat"] bleu = corpus_bleu(candidates, references) print(f"BLEU: {bleu.score:.2f}") # Higher! INTERPRETING BLEU SCORES: ┌───────────────┬────────────────────────────┐ │ Score │ Interpretation │ ├───────────────┼────────────────────────────┤ │ < 10 │ Almost useless │ │ 10-19 │ Hard to understand │ │ 20-29 │ Gist is clear │ │ 30-39 │ Understandable │ │ 40-49 │ High quality │ │ 50-59 │ Very high quality │ │ > 60 │ Approaching human level │ └───────────────┴────────────────────────────┘ BLEU LIMITATIONS: - Doesn't capture meaning/semantics - Penalizes valid paraphrases - Sensitive to tokenization - Sentence-level is unreliable - Not great for open-ended generation WHEN TO USE BLEU: ✓ Machine translation ✓ Comparing translation systems ✗ Creative writing ✗ Open-ended generation ✗ Summarization (use ROUGE instead)

Interactive Exercise

Calculate Unigram Precision

Reference: "I love machine learning"
Candidate: "I really love deep learning"

What is the unigram (1-gram) precision?

Pro Tips
  • Use SacreBLEU for reproducible scores with consistent tokenization
  • Always report corpus-level BLEU, not sentence-level averages
  • Multiple references significantly improve BLEU reliability
  • BLEU correlates with human judgment but isn't perfect

Related Terms