Calculation
BLEU Score Formula
BLEU SCORE CALCULATION:
BLEU = BP × exp(Σ wₙ × log(pₙ))
Where:
- BP = Brevity Penalty
- pₙ = modified precision for n-grams
- wₙ = weights (usually uniform: 1/4 each)
STEP-BY-STEP EXAMPLE:
Reference: "The cat sat on the mat"
Candidate: "The cat is on the mat"
1-GRAMS (unigrams):
Candidate: [the, cat, is, on, the, mat]
Reference: [the, cat, sat, on, the, mat]
Matches: the(2), cat(1), on(1), mat(1) = 5/6
p₁ = 5/6 = 0.833
2-GRAMS (bigrams):
Candidate: [the cat, cat is, is on, on the, the mat]
Reference: [the cat, cat sat, sat on, on the, the mat]
Matches: the cat(1), on the(1), the mat(1) = 3/5
p₂ = 3/5 = 0.6
3-GRAMS:
Candidate: [the cat is, cat is on, is on the, on the mat]
Reference: [the cat sat, cat sat on, sat on the, on the mat]
Matches: on the mat(1) = 1/4
p₃ = 1/4 = 0.25
4-GRAMS:
Candidate: [the cat is on, cat is on the, is on the mat]
Reference: [the cat sat on, cat sat on the, sat on the mat]
Matches: 0/3
p₄ = 0/3 = 0
BREVITY PENALTY:
c = candidate length = 6
r = reference length = 6
BP = exp(1 - r/c) if c < r, else 1
BP = 1 (same length)
BLEU = 1 × exp(0.25 × (log(0.833) + log(0.6)
+ log(0.25) + log(ε)))
≈ 0.30 (30%)
Usage
BLEU in Practice
USING BLEU (Python):
from sacrebleu import corpus_bleu
# Single reference
reference = [["The cat sat on the mat"]]
candidate = ["The cat is on the mat"]
bleu = corpus_bleu(candidate, reference)
print(f"BLEU: {bleu.score:.2f}") # ~30
# Multiple references (better!)
references = [
["The cat sat on the mat"],
["A cat is sitting on the mat"],
["The cat was on the mat"]
]
candidates = ["The cat is on the mat"]
bleu = corpus_bleu(candidates, references)
print(f"BLEU: {bleu.score:.2f}") # Higher!
INTERPRETING BLEU SCORES:
┌───────────────┬────────────────────────────┐
│ Score │ Interpretation │
├───────────────┼────────────────────────────┤
│ < 10 │ Almost useless │
│ 10-19 │ Hard to understand │
│ 20-29 │ Gist is clear │
│ 30-39 │ Understandable │
│ 40-49 │ High quality │
│ 50-59 │ Very high quality │
│ > 60 │ Approaching human level │
└───────────────┴────────────────────────────┘
BLEU LIMITATIONS:
- Doesn't capture meaning/semantics
- Penalizes valid paraphrases
- Sensitive to tokenization
- Sentence-level is unreliable
- Not great for open-ended generation
WHEN TO USE BLEU:
✓ Machine translation
✓ Comparing translation systems
✗ Creative writing
✗ Open-ended generation
✗ Summarization (use ROUGE instead)