Evaluation / Methods

Human Evaluation

Foundational [2/5]
Human judgment Manual evaluation Human raters

Definition

Human evaluation involves having people assess AI-generated outputs for quality, accuracy, helpfulness, and other criteria. While slower and more expensive than automated metrics, human evaluation captures nuances that algorithms miss and serves as the ground truth for what "good" output looks like.

Human evaluation is essential for validating automated metrics and evaluating subjective qualities.

Key Concepts

  • Inter-annotator agreement: Consistency between different raters
  • Rubrics: Detailed criteria for consistent evaluation
  • Pairwise comparison: Choosing between two outputs (easier than absolute rating)
  • Crowdsourcing: Using many non-expert raters vs. few experts

Examples

Methods
Human Evaluation Approaches
EVALUATION METHODS: 1. LIKERT SCALE RATING: "Rate this response from 1-5" 1 = Poor (unhelpful, incorrect) 2 = Below average 3 = Acceptable 4 = Good 5 = Excellent Pros: Granular feedback Cons: Calibration varies between raters 2. PAIRWISE COMPARISON: "Which response is better: A or B?" Options: A wins / B wins / Tie Pros: Easier for raters, more consistent Cons: Need many comparisons for ranking Use: Chatbot Arena (Elo ratings) 3. MULTI-DIMENSIONAL RATING: Rate each dimension separately: - Accuracy: 1-5 - Helpfulness: 1-5 - Fluency: 1-5 - Safety: 1-5 Pros: Identifies specific weaknesses Cons: Takes longer per evaluation 4. BINARY CLASSIFICATION: "Is this response acceptable? Yes/No" "Contains factual errors? Yes/No" Pros: Simple, high agreement Cons: Loses nuance 5. ERROR ANNOTATION: Highlight specific issues in response: - [HALLUCINATION] "Paris is in Germany" - [GRAMMAR] "They was going" - [INCOMPLETE] Missing key information Pros: Actionable feedback Cons: Time-consuming, requires training
Best Practices
Running Human Evaluations
HUMAN EVALUATION BEST PRACTICES: 1. CLEAR RUBRICS: BAD: "Rate helpfulness" GOOD: "Rate helpfulness: 5 = Fully answers the question 4 = Mostly answers, minor gaps 3 = Partially answers 2 = Attempts but largely fails 1 = Does not address question" 2. MULTIPLE RATERS: - Get 3+ ratings per item - Calculate inter-annotator agreement - Cohen's Kappa > 0.6 = acceptable - Resolve disagreements with discussion 3. QUALITY CONTROL: - Include "gold" questions with known answers - Flag raters with low gold accuracy - Check for random/inattentive responses - Time per task sanity checks 4. BIAS MITIGATION: - Randomize order of options - Hide model identity (blind evaluation) - Balance sample selection - Diverse rater pool 5. SAMPLE SIZE: - Minimum 100-200 examples - More for pairwise (need all pairs) - Power analysis for significance PLATFORMS: - Amazon Mechanical Turk - Scale AI - Surge AI - Internal annotation teams COST ESTIMATION: Per evaluation: $0.10 - $5.00 100 examples × 3 raters × $0.50 = $150 For rigorous eval: $500 - $5,000

Interactive Exercise

Design a Rubric

Create a 5-point rubric for evaluating "factual accuracy" of AI responses about historical events.

Pro Tips
  • Pilot test rubrics with a small sample before full evaluation
  • Pairwise comparison has higher agreement than absolute ratings
  • Domain experts are worth the cost for technical content
  • Combine human evaluation with automated metrics for efficiency

Related Terms