Evaluation / Production

A/B Testing

Foundational [2/5]
Split testing Controlled experiment Online evaluation

Definition

A/B testing compares two or more variants (models, prompts, features) by randomly assigning users to different groups and measuring their outcomes. This provides statistically rigorous evidence about which variant performs better in production with real users.

A/B testing is the gold standard for measuring actual impact on user behavior and business metrics.

Key Concepts

  • Control vs Treatment: Baseline (A) compared against new variant (B)
  • Random assignment: Users randomly split between groups
  • Statistical significance: Confidence that difference isn't random
  • Sample size: Enough users needed for reliable results

Examples

Setup
A/B Test Structure for LLMs
A/B TEST STRUCTURE: CONTROL GROUP (A): - 50% of users - Current model/prompt - Baseline metrics TREATMENT GROUP (B): - 50% of users - New model/prompt - Measure improvement EXAMPLE: Testing new model version SETUP: ┌──────────────────────────────────────────┐ │ All Users (100%) │ └────────────────┬─────────────────────────┘ │ Random Split ┌────────┴────────┐ ↓ ↓ ┌───────────────┐ ┌───────────────┐ │ Control (A) │ │ Treatment (B) │ │ 50% │ │ 50% │ ├───────────────┤ ├───────────────┤ │ GPT-4-0613 │ │ GPT-4-0125 │ │ Old prompt │ │ New prompt │ └───────────────┘ └───────────────┘ │ │ ↓ ↓ ┌───────────────┐ ┌───────────────┐ │ Measure: │ │ Measure: │ │ - Completion │ │ - Completion │ │ - Satisfaction│ │ - Satisfaction│ │ - Latency │ │ - Latency │ └───────────────┘ └───────────────┘ │ │ └────────┬────────┘ ↓ ┌───────────────┐ │ Compare │ │ Metrics │ │ (Statistical │ │ significance)│ └───────────────┘
Implementation
Running LLM A/B Tests
A/B TEST IMPLEMENTATION: import hashlib import random def get_variant(user_id, experiment_name, split=0.5): """Deterministic assignment based on user_id""" hash_input = f"{user_id}:{experiment_name}" hash_val = int(hashlib.md5(hash_input.encode()).hexdigest(), 16) return "treatment" if (hash_val % 100) / 100 < split else "control" # Usage variant = get_variant(user_id, "new_model_test") if variant == "treatment": response = call_new_model(prompt) else: response = call_current_model(prompt) # Log for analysis log_event({ "user_id": user_id, "variant": variant, "experiment": "new_model_test", "response_time_ms": latency, "user_rating": rating, "task_completed": completed }) METRICS TO TRACK: Quality metrics: - User satisfaction ratings - Task completion rate - Follow-up questions (fewer = better) - Error reports Engagement metrics: - Session length - Messages per session - Return rate Operational metrics: - Latency (p50, p95, p99) - Cost per request - Token usage SAMPLE SIZE CALCULATION: For detecting 5% improvement in 20% baseline: - Minimum detectable effect: 5% - Baseline conversion: 20% - Statistical power: 80% - Significance level: 5% Required: ~3,000 users per group Duration: depends on traffic

Interactive Exercise

Design an A/B Test

You want to test if a new system prompt improves user satisfaction. Design the A/B test:

Pro Tips
  • Run tests long enough to capture weekly patterns
  • Don't peek at results early - wait for significance
  • Watch for guardrail metrics (latency, errors) even if not primary
  • Consider interaction effects with other features

Related Terms