A/B Testing | HyperKit.ai

Definition

A/B testing compares two or more variants (models, prompts, features) by randomly assigning users to different groups and measuring their outcomes. This provides statistically rigorous evidence about which variant performs better in production with real users.

A/B testing is the gold standard for measuring actual impact on user behavior and business metrics.

Key Concepts

Control vs Treatment: Baseline (A) compared against new variant (B)
Random assignment: Users randomly split between groups
Statistical significance: Confidence that difference isn't random
Sample size: Enough users needed for reliable results

Examples

Setup

A/B Test Structure for LLMs

A/B TEST STRUCTURE:

CONTROL GROUP (A):
- 50% of users
- Current model/prompt
- Baseline metrics

TREATMENT GROUP (B):
- 50% of users
- New model/prompt
- Measure improvement

EXAMPLE: Testing new model version

SETUP:
┌──────────────────────────────────────────┐
│           All Users (100%)               │
└────────────────┬─────────────────────────┘
                 │ Random Split
        ┌────────┴────────┐
        ↓                 ↓
┌───────────────┐  ┌───────────────┐
│  Control (A)  │  │ Treatment (B) │
│     50%       │  │     50%       │
├───────────────┤  ├───────────────┤
│ GPT-4-0613    │  │ GPT-4-0125    │
│ Old prompt    │  │ New prompt    │
└───────────────┘  └───────────────┘
        │                 │
        ↓                 ↓
┌───────────────┐  ┌───────────────┐
│ Measure:      │  │ Measure:      │
│ - Completion  │  │ - Completion  │
│ - Satisfaction│  │ - Satisfaction│
│ - Latency     │  │ - Latency     │
└───────────────┘  └───────────────┘
        │                 │
        └────────┬────────┘
                 ↓
        ┌───────────────┐
        │   Compare     │
        │   Metrics     │
        │ (Statistical  │
        │  significance)│
        └───────────────┘

Implementation

Running LLM A/B Tests

A/B TEST IMPLEMENTATION:

import hashlib
import random

def get_variant(user_id, experiment_name, split=0.5):
    """Deterministic assignment based on user_id"""
    hash_input = f"{user_id}:{experiment_name}"
    hash_val = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    return "treatment" if (hash_val % 100) / 100 < split else "control"

# Usage
variant = get_variant(user_id, "new_model_test")

if variant == "treatment":
    response = call_new_model(prompt)
else:
    response = call_current_model(prompt)

# Log for analysis
log_event({
    "user_id": user_id,
    "variant": variant,
    "experiment": "new_model_test",
    "response_time_ms": latency,
    "user_rating": rating,
    "task_completed": completed
})

METRICS TO TRACK:

Quality metrics:
- User satisfaction ratings
- Task completion rate
- Follow-up questions (fewer = better)
- Error reports

Engagement metrics:
- Session length
- Messages per session
- Return rate

Operational metrics:
- Latency (p50, p95, p99)
- Cost per request
- Token usage

SAMPLE SIZE CALCULATION:

For detecting 5% improvement in 20% baseline:
- Minimum detectable effect: 5%
- Baseline conversion: 20%
- Statistical power: 80%
- Significance level: 5%

Required: ~3,000 users per group
Duration: depends on traffic

Interactive Exercise

✎

Design an A/B Test

You want to test if a new system prompt improves user satisfaction. Design the A/B test:

Pro Tips

Run tests long enough to capture weekly patterns
Don't peek at results early - wait for significance
Watch for guardrail metrics (latency, errors) even if not primary
Consider interaction effects with other features

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms