ROUGE Score | HyperKit.ai

Definition

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating text summarization and other generation tasks. Unlike BLEU which focuses on precision, ROUGE emphasizes recall - how much of the reference content appears in the generated text.

ROUGE is the standard metric for summarization tasks and captures content coverage.

Key Concepts

ROUGE-N: N-gram overlap (ROUGE-1, ROUGE-2 most common)
ROUGE-L: Longest Common Subsequence
Recall-focused: Measures coverage of reference content
F1 score: Often report harmonic mean of precision and recall

Examples

Variants

ROUGE Score Types

ROUGE METRIC VARIANTS:

1. ROUGE-1 (Unigram overlap):
   Reference: "The cat sat on the mat"
   Candidate: "A cat on the mat"

   Ref unigrams: {the, cat, sat, on, mat} = 5
   Cand unigrams: {a, cat, on, the, mat} = 5
   Overlap: {cat, on, the, mat} = 4

   Recall = 4/5 = 0.80 (80% of ref covered)
   Precision = 4/5 = 0.80
   F1 = 2×0.8×0.8/(0.8+0.8) = 0.80

2. ROUGE-2 (Bigram overlap):
   Ref bigrams: {the cat, cat sat, sat on, on the, the mat}
   Cand bigrams: {a cat, cat on, on the, the mat}
   Overlap: {on the, the mat} = 2

   Recall = 2/5 = 0.40
   Precision = 2/4 = 0.50
   F1 = 2×0.4×0.5/(0.4+0.5) = 0.44

3. ROUGE-L (Longest Common Subsequence):
   Reference: "The cat sat on the mat"
   Candidate: "A cat on the mat"

   LCS: "cat on the mat" (length 4)

   Recall = LCS_len / ref_len = 4/6 = 0.67
   Precision = LCS_len / cand_len = 4/5 = 0.80
   F1 = 0.73

WHY LCS?
Captures sequence order without requiring
consecutive matches. More flexible than n-grams.

Comparison

BLEU vs ROUGE

BLEU VS ROUGE:

┌────────────────┬─────────────────────────────┐
│ Aspect         │ BLEU         │ ROUGE       │
├────────────────┼─────────────────────────────┤
│ Focus          │ Precision    │ Recall      │
│ Primary use    │ Translation  │ Summary     │
│ Question       │ "Is output   │ "Is ref     │
│                │  in ref?"    │  in output?"│
│ Length bias    │ Penalizes    │ Rewards     │
│                │  short       │  long       │
└────────────────┴─────────────────────────────┘

EXAMPLE:
Reference: "The quick brown fox jumps"
Candidate: "brown fox"

BLEU (precision-focused):
  Precision: 2/2 = 100% (all cand in ref)
  But: Brevity penalty hurts score

ROUGE-1 (recall-focused):
  Recall: 2/5 = 40% (only 2 of 5 words covered)
  Summary missing key info!

WHEN TO USE EACH:

BLEU:
✓ Machine translation
✓ Where exact wording matters
✓ Similar length outputs expected

ROUGE:
✓ Summarization
✓ Question answering
✓ Where content coverage matters
✓ Variable length outputs OK

PRACTICAL USAGE:
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(
    ['rouge1', 'rouge2', 'rougeL'],
    use_stemmer=True
)

scores = scorer.score(reference, candidate)
print(f"ROUGE-1 F1: {scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-2 F1: {scores['rouge2'].fmeasure:.3f}")
print(f"ROUGE-L F1: {scores['rougeL'].fmeasure:.3f}")

Interactive Exercise

✎

Calculate ROUGE-1

Reference: "Machine learning is transforming industries"
Candidate: "AI and machine learning transform business"

Calculate ROUGE-1 recall, precision, and F1.

Pro Tips

ROUGE-2 correlates better with human judgment than ROUGE-1
Use stemming (use_stemmer=True) for better matching
Report F1 score for balanced precision/recall
ROUGE doesn't capture semantic similarity - consider BERTScore too

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms