BLEU Score | HyperKit.ai

Definition

BLEU (Bilingual Evaluation Understudy) is an automated metric for evaluating machine-generated text against reference translations. It measures n-gram precision - how many n-grams in the generated text appear in the reference.

BLEU scores range from 0 to 1 (or 0-100), where higher scores indicate better overlap with reference text.

Key Concepts

N-gram precision: Matches 1-grams, 2-grams, 3-grams, 4-grams
Brevity penalty: Penalizes outputs shorter than reference
Multiple references: Can compare against several valid translations
Corpus-level: More reliable on document/corpus than sentences

Examples

Calculation

BLEU Score Formula

BLEU SCORE CALCULATION:

BLEU = BP × exp(Σ wₙ × log(pₙ))

Where:
- BP = Brevity Penalty
- pₙ = modified precision for n-grams
- wₙ = weights (usually uniform: 1/4 each)

STEP-BY-STEP EXAMPLE:

Reference: "The cat sat on the mat"
Candidate: "The cat is on the mat"

1-GRAMS (unigrams):
Candidate: [the, cat, is, on, the, mat]
Reference: [the, cat, sat, on, the, mat]
Matches: the(2), cat(1), on(1), mat(1) = 5/6
p₁ = 5/6 = 0.833

2-GRAMS (bigrams):
Candidate: [the cat, cat is, is on, on the, the mat]
Reference: [the cat, cat sat, sat on, on the, the mat]
Matches: the cat(1), on the(1), the mat(1) = 3/5
p₂ = 3/5 = 0.6

3-GRAMS:
Candidate: [the cat is, cat is on, is on the, on the mat]
Reference: [the cat sat, cat sat on, sat on the, on the mat]
Matches: on the mat(1) = 1/4
p₃ = 1/4 = 0.25

4-GRAMS:
Candidate: [the cat is on, cat is on the, is on the mat]
Reference: [the cat sat on, cat sat on the, sat on the mat]
Matches: 0/3
p₄ = 0/3 = 0

BREVITY PENALTY:
c = candidate length = 6
r = reference length = 6
BP = exp(1 - r/c) if c < r, else 1
BP = 1 (same length)

BLEU = 1 × exp(0.25 × (log(0.833) + log(0.6)
                       + log(0.25) + log(ε)))
     ≈ 0.30 (30%)

Usage

BLEU in Practice

USING BLEU (Python):

from sacrebleu import corpus_bleu

# Single reference
reference = [["The cat sat on the mat"]]
candidate = ["The cat is on the mat"]

bleu = corpus_bleu(candidate, reference)
print(f"BLEU: {bleu.score:.2f}")  # ~30

# Multiple references (better!)
references = [
    ["The cat sat on the mat"],
    ["A cat is sitting on the mat"],
    ["The cat was on the mat"]
]
candidates = ["The cat is on the mat"]

bleu = corpus_bleu(candidates, references)
print(f"BLEU: {bleu.score:.2f}")  # Higher!

INTERPRETING BLEU SCORES:
┌───────────────┬────────────────────────────┐
│ Score         │ Interpretation             │
├───────────────┼────────────────────────────┤
│ < 10          │ Almost useless             │
│ 10-19         │ Hard to understand         │
│ 20-29         │ Gist is clear              │
│ 30-39         │ Understandable             │
│ 40-49         │ High quality               │
│ 50-59         │ Very high quality          │
│ > 60          │ Approaching human level    │
└───────────────┴────────────────────────────┘

BLEU LIMITATIONS:
- Doesn't capture meaning/semantics
- Penalizes valid paraphrases
- Sensitive to tokenization
- Sentence-level is unreliable
- Not great for open-ended generation

WHEN TO USE BLEU:
✓ Machine translation
✓ Comparing translation systems
✗ Creative writing
✗ Open-ended generation
✗ Summarization (use ROUGE instead)

Interactive Exercise

✎

Calculate Unigram Precision

Reference: "I love machine learning"
Candidate: "I really love deep learning"

What is the unigram (1-gram) precision?

Pro Tips

Use SacreBLEU for reproducible scores with consistent tokenization
Always report corpus-level BLEU, not sentence-level averages
Multiple references significantly improve BLEU reliability
BLEU correlates with human judgment but isn't perfect

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms