Evaluation / Metrics

ROUGE Score

Intermediate [3/5]
Recall-Oriented Understudy for Gisting Evaluation ROUGE metric

Definition

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating text summarization and other generation tasks. Unlike BLEU which focuses on precision, ROUGE emphasizes recall - how much of the reference content appears in the generated text.

ROUGE is the standard metric for summarization tasks and captures content coverage.

Key Concepts

  • ROUGE-N: N-gram overlap (ROUGE-1, ROUGE-2 most common)
  • ROUGE-L: Longest Common Subsequence
  • Recall-focused: Measures coverage of reference content
  • F1 score: Often report harmonic mean of precision and recall

Examples

Variants
ROUGE Score Types
ROUGE METRIC VARIANTS: 1. ROUGE-1 (Unigram overlap): Reference: "The cat sat on the mat" Candidate: "A cat on the mat" Ref unigrams: {the, cat, sat, on, mat} = 5 Cand unigrams: {a, cat, on, the, mat} = 5 Overlap: {cat, on, the, mat} = 4 Recall = 4/5 = 0.80 (80% of ref covered) Precision = 4/5 = 0.80 F1 = 2×0.8×0.8/(0.8+0.8) = 0.80 2. ROUGE-2 (Bigram overlap): Ref bigrams: {the cat, cat sat, sat on, on the, the mat} Cand bigrams: {a cat, cat on, on the, the mat} Overlap: {on the, the mat} = 2 Recall = 2/5 = 0.40 Precision = 2/4 = 0.50 F1 = 2×0.4×0.5/(0.4+0.5) = 0.44 3. ROUGE-L (Longest Common Subsequence): Reference: "The cat sat on the mat" Candidate: "A cat on the mat" LCS: "cat on the mat" (length 4) Recall = LCS_len / ref_len = 4/6 = 0.67 Precision = LCS_len / cand_len = 4/5 = 0.80 F1 = 0.73 WHY LCS? Captures sequence order without requiring consecutive matches. More flexible than n-grams.
Comparison
BLEU vs ROUGE
BLEU VS ROUGE: ┌────────────────┬─────────────────────────────┐ │ Aspect │ BLEU │ ROUGE │ ├────────────────┼─────────────────────────────┤ │ Focus │ Precision │ Recall │ │ Primary use │ Translation │ Summary │ │ Question │ "Is output │ "Is ref │ │ │ in ref?" │ in output?"│ │ Length bias │ Penalizes │ Rewards │ │ │ short │ long │ └────────────────┴─────────────────────────────┘ EXAMPLE: Reference: "The quick brown fox jumps" Candidate: "brown fox" BLEU (precision-focused): Precision: 2/2 = 100% (all cand in ref) But: Brevity penalty hurts score ROUGE-1 (recall-focused): Recall: 2/5 = 40% (only 2 of 5 words covered) Summary missing key info! WHEN TO USE EACH: BLEU: ✓ Machine translation ✓ Where exact wording matters ✓ Similar length outputs expected ROUGE: ✓ Summarization ✓ Question answering ✓ Where content coverage matters ✓ Variable length outputs OK PRACTICAL USAGE: from rouge_score import rouge_scorer scorer = rouge_scorer.RougeScorer( ['rouge1', 'rouge2', 'rougeL'], use_stemmer=True ) scores = scorer.score(reference, candidate) print(f"ROUGE-1 F1: {scores['rouge1'].fmeasure:.3f}") print(f"ROUGE-2 F1: {scores['rouge2'].fmeasure:.3f}") print(f"ROUGE-L F1: {scores['rougeL'].fmeasure:.3f}")

Interactive Exercise

Calculate ROUGE-1

Reference: "Machine learning is transforming industries"
Candidate: "AI and machine learning transform business"

Calculate ROUGE-1 recall, precision, and F1.

Pro Tips
  • ROUGE-2 correlates better with human judgment than ROUGE-1
  • Use stemming (use_stemmer=True) for better matching
  • Report F1 score for balanced precision/recall
  • ROUGE doesn't capture semantic similarity - consider BERTScore too

Related Terms