Human Evaluation | HyperKit.ai

Definition

Human evaluation involves having people assess AI-generated outputs for quality, accuracy, helpfulness, and other criteria. While slower and more expensive than automated metrics, human evaluation captures nuances that algorithms miss and serves as the ground truth for what "good" output looks like.

Human evaluation is essential for validating automated metrics and evaluating subjective qualities.

Key Concepts

Inter-annotator agreement: Consistency between different raters
Rubrics: Detailed criteria for consistent evaluation
Pairwise comparison: Choosing between two outputs (easier than absolute rating)
Crowdsourcing: Using many non-expert raters vs. few experts

Examples

Methods

Human Evaluation Approaches

EVALUATION METHODS:

1. LIKERT SCALE RATING:
   "Rate this response from 1-5"

   1 = Poor (unhelpful, incorrect)
   2 = Below average
   3 = Acceptable
   4 = Good
   5 = Excellent

   Pros: Granular feedback
   Cons: Calibration varies between raters

2. PAIRWISE COMPARISON:
   "Which response is better: A or B?"

   Options: A wins / B wins / Tie

   Pros: Easier for raters, more consistent
   Cons: Need many comparisons for ranking
   Use: Chatbot Arena (Elo ratings)

3. MULTI-DIMENSIONAL RATING:
   Rate each dimension separately:
   - Accuracy: 1-5
   - Helpfulness: 1-5
   - Fluency: 1-5
   - Safety: 1-5

   Pros: Identifies specific weaknesses
   Cons: Takes longer per evaluation

4. BINARY CLASSIFICATION:
   "Is this response acceptable? Yes/No"
   "Contains factual errors? Yes/No"

   Pros: Simple, high agreement
   Cons: Loses nuance

5. ERROR ANNOTATION:
   Highlight specific issues in response:
   - [HALLUCINATION] "Paris is in Germany"
   - [GRAMMAR] "They was going"
   - [INCOMPLETE] Missing key information

   Pros: Actionable feedback
   Cons: Time-consuming, requires training

Best Practices

Running Human Evaluations

HUMAN EVALUATION BEST PRACTICES:

1. CLEAR RUBRICS:
   BAD: "Rate helpfulness"
   GOOD: "Rate helpfulness:
         5 = Fully answers the question
         4 = Mostly answers, minor gaps
         3 = Partially answers
         2 = Attempts but largely fails
         1 = Does not address question"

2. MULTIPLE RATERS:
   - Get 3+ ratings per item
   - Calculate inter-annotator agreement
   - Cohen's Kappa > 0.6 = acceptable
   - Resolve disagreements with discussion

3. QUALITY CONTROL:
   - Include "gold" questions with known answers
   - Flag raters with low gold accuracy
   - Check for random/inattentive responses
   - Time per task sanity checks

4. BIAS MITIGATION:
   - Randomize order of options
   - Hide model identity (blind evaluation)
   - Balance sample selection
   - Diverse rater pool

5. SAMPLE SIZE:
   - Minimum 100-200 examples
   - More for pairwise (need all pairs)
   - Power analysis for significance

PLATFORMS:
- Amazon Mechanical Turk
- Scale AI
- Surge AI
- Internal annotation teams

COST ESTIMATION:
Per evaluation: $0.10 - $5.00
100 examples × 3 raters × $0.50 = $150
For rigorous eval: $500 - $5,000

Interactive Exercise

✎

Design a Rubric

Create a 5-point rubric for evaluating "factual accuracy" of AI responses about historical events.

Pro Tips

Pilot test rubrics with a small sample before full evaluation
Pairwise comparison has higher agreement than absolute ratings
Domain experts are worth the cost for technical content
Combine human evaluation with automated metrics for efficiency

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms