Evaluation / Metrics

Evaluation

Foundational [2/5]
Model evaluation LLM evals Assessment

Definition

Evaluation is the process of measuring how well an LLM performs on specific tasks or criteria. This includes automated metrics, benchmark tests, and human assessments to understand model capabilities, limitations, and fitness for particular use cases.

Effective evaluation is crucial for model selection, development decisions, and ensuring quality in production.

Key Concepts

  • Automated metrics: Programmatic measures (BLEU, ROUGE, accuracy)
  • Human evaluation: Expert or crowdsourced quality ratings
  • LLM-as-judge: Using another LLM to evaluate outputs
  • Evals framework: Systematic evaluation pipelines

Examples

Types
Evaluation Approaches
EVALUATION TAXONOMY: 1. INTRINSIC EVALUATION: Measures model properties directly - Perplexity (language modeling) - Loss on validation set - Calibration (confidence = accuracy) 2. EXTRINSIC EVALUATION: Measures performance on downstream tasks - QA accuracy - Summarization quality - Code correctness (pass@k) 3. AUTOMATED METRICS: ┌─────────────┬─────────────────────────┐ │ Metric │ What it measures │ ├─────────────┼─────────────────────────┤ │ Accuracy │ Exact match rate │ │ F1 Score │ Precision/recall balance│ │ BLEU │ N-gram overlap (trans.) │ │ ROUGE │ Overlap (summarization) │ │ Pass@k │ Code correctness │ │ Perplexity │ Language model quality │ └─────────────┴─────────────────────────┘ 4. HUMAN EVALUATION: - Pairwise comparison (A vs B) - Likert scale ratings (1-5) - Task completion rate - Annotation of errors 5. LLM-AS-JUDGE: Use GPT-4/Claude to rate outputs "Rate this response 1-10 for helpfulness" Cheaper than humans, correlates well But: potential biases (length, style)
Framework
Building an Eval Pipeline
EVALUATION FRAMEWORK STRUCTURE: # Example using OpenAI Evals pattern evals/ ├── my_eval.yaml # Eval definition ├── data/ │ └── test_cases.jsonl # Test inputs/outputs └── criteria/ └── grading.py # Scoring logic EVAL DEFINITION (YAML): name: customer_support_quality description: Evaluate customer support responses dataset: ./data/support_queries.jsonl metrics: - accuracy # factual correctness - helpfulness # rated 1-5 - tone # appropriate/not TEST CASE FORMAT (JSONL): {"input": "How do I reset my password?", "expected_topics": ["password", "reset", "email"], "expected_action": "provide_instructions"} {"input": "I'm very frustrated with your service!", "expected_topics": ["empathy", "resolution"], "expected_tone": "apologetic, helpful"} SCORING IMPLEMENTATION: def evaluate_response(input, output, expected): scores = {} # Automated checks scores['contains_keywords'] = check_topics( output, expected['expected_topics'] ) # LLM-as-judge scores['helpfulness'] = llm_judge( prompt=f"Rate helpfulness 1-5:\n{output}", rubric="5=fully resolves, 1=unhelpful" ) # Aggregate scores['overall'] = weighted_average(scores) return scores RUNNING EVALS: results = eval_runner.run( model="gpt-4", eval_name="customer_support_quality", sample_size=500 ) print(f"Accuracy: {results.accuracy:.2%}") print(f"Helpfulness: {results.helpfulness:.1f}/5")

Interactive Exercise

Choose Evaluation Methods

For each use case, what evaluation method would be most appropriate and why?

1. Translation quality

2. Creative writing

3. Code generation

Pro Tips
  • Automated metrics are fast but may not capture what humans care about
  • LLM-as-judge is cost-effective but can have systematic biases
  • Always validate automated evals against human judgment
  • Track evaluation metrics over time, not just at launch

Related Terms