Evaluation / Metrics

Benchmark

Foundational [2/5]
Evaluation benchmark Test suite Leaderboard

Definition

Benchmarks are standardized test suites used to evaluate and compare AI model performance on specific tasks. They provide consistent datasets, evaluation protocols, and metrics that enable fair comparisons across different models and approaches.

Benchmarks drive progress by giving researchers clear targets and enabling reproducible evaluation.

Key Concepts

  • Test set: Held-out data for evaluation (not used in training)
  • Metrics: Standardized measures (accuracy, F1, perplexity, etc.)
  • Leaderboard: Rankings of model performance on benchmark
  • Contamination: When benchmark data leaks into training

Examples

Popular
Major LLM Benchmarks
POPULAR LLM BENCHMARKS: GENERAL KNOWLEDGE: ┌─────────────┬────────────────────────────────┐ │ MMLU │ 57 subjects, multiple choice │ │ │ Tests broad knowledge │ ├─────────────┼────────────────────────────────┤ │ HellaSwag │ Commonsense reasoning │ │ │ Sentence completion │ ├─────────────┼────────────────────────────────┤ │ TruthfulQA │ Tests for hallucination │ │ │ Truthful responses │ └─────────────┴────────────────────────────────┘ REASONING: ┌─────────────┬────────────────────────────────┐ │ GSM8K │ Grade school math problems │ │ │ Multi-step arithmetic │ ├─────────────┼────────────────────────────────┤ │ MATH │ Competition mathematics │ │ │ Harder than GSM8K │ ├─────────────┼────────────────────────────────┤ │ ARC │ AI2 Reasoning Challenge │ │ │ Science questions │ ├─────────────┼────────────────────────────────┤ │ BBH │ BIG-Bench Hard │ │ │ Challenging reasoning tasks │ └─────────────┴────────────────────────────────┘ CODING: ┌─────────────┬────────────────────────────────┐ │ HumanEval │ Python coding problems │ │ │ Function completion │ ├─────────────┼────────────────────────────────┤ │ MBPP │ Mostly Basic Python Problems │ │ │ Simpler coding tasks │ ├─────────────┼────────────────────────────────┤ │ SWE-bench │ Real GitHub issues │ │ │ Full repository context │ └─────────────┴────────────────────────────────┘ AGGREGATE SCORES: - Open LLM Leaderboard (HuggingFace) - Chatbot Arena (crowdsourced ELO) - LMSYS rankings
Considerations
Benchmark Limitations
BENCHMARK CHALLENGES: 1. CONTAMINATION: Problem: Benchmark in training data Model: "Memorized the answers!" Example: GPT-4 scores 90% on benchmark X But benchmark X was on the internet Model may have seen exact questions Solutions: - Time-based splits (train before date X) - Held-out private test sets - Canary strings to detect leakage 2. GOODHART'S LAW: "When a measure becomes a target, it ceases to be a good measure" Models optimized for benchmarks may: - Game specific patterns - Not generalize to real tasks - Miss important capabilities 3. NARROW SCOPE: MMLU tests knowledge, not: - Creativity - Real-world usefulness - Safety - Instruction following 4. SATURATION: Best models: 90%+ on many benchmarks Little room to differentiate Need harder benchmarks 5. EVALUATION VARIANCE: Same model, different scores: - Prompt format matters - Temperature settings - Few-shot examples - Sampling randomness BEST PRACTICES: - Use multiple benchmarks - Include real-world evals - Report confidence intervals - Test for contamination - Human evaluation too

Interactive Exercise

Benchmark Selection

You're evaluating a model for a customer service chatbot. Which benchmarks would be most relevant, and which capabilities wouldn't be measured?

Pro Tips
  • Chatbot Arena ELO often correlates better with user preference than static benchmarks
  • Create custom benchmarks for your specific use case
  • Watch for benchmark contamination in model releases
  • Combine automated benchmarks with human evaluation

Related Terms