Throughput | HyperKit.ai

Definition

Throughput measures the volume of work an LLM system can process over time, typically expressed as tokens per second or requests per second. High throughput is essential for serving many users efficiently and keeping infrastructure costs manageable.

Throughput and latency often have an inverse relationship - optimizing for one may hurt the other.

Key Concepts

Tokens/second: Raw generation speed across all requests
Requests/second: Number of complete requests handled
Batching: Processing multiple requests together for efficiency
Capacity planning: Matching throughput to expected load

Examples

Comparison

Throughput vs Latency Trade-offs

THROUGHPUT VS LATENCY:

SCENARIO 1: Single request (optimize latency)
┌─────────────────────────────────────┐
│ GPU: Processing 1 request           │
│ Latency: 100ms                      │
│ Throughput: 10 req/s                │
│ GPU utilization: 20%                │
└─────────────────────────────────────┘
Fast for user, but inefficient!

SCENARIO 2: Batch of 8 (optimize throughput)
┌─────────────────────────────────────┐
│ GPU: Processing 8 requests together │
│ Latency: 150ms (50% higher)         │
│ Throughput: 53 req/s (5x higher!)   │
│ GPU utilization: 90%                │
└─────────────────────────────────────┘
Slower per request, but 5x more efficient!

CONTINUOUS BATCHING (best of both):
┌─────────────────────────────────────┐
│ New requests added as others finish │
│ GPU always processing full batch    │
│ Latency: ~120ms (only 20% higher)   │
│ Throughput: 45 req/s (4.5x higher!) │
│ GPU utilization: 85%                │
└─────────────────────────────────────┘

THE TRADE-OFF CURVE:
Latency ↑
    │   ╲
    │    ╲ Pareto frontier
    │     ╲
    │      ╲__________
    │                 ╲
    └─────────────────────► Throughput

You can't have both optimal latency
AND optimal throughput - pick your priority!

Optimization

Maximizing Throughput

THROUGHPUT OPTIMIZATION TECHNIQUES:

1. CONTINUOUS BATCHING (vLLM, TGI):
   # Instead of waiting for batch to complete
   # Add new requests as slots free up

   Traditional:  [R1, R2, R3, R4] → wait → [R5, R6, R7, R8]
   Continuous:   [R1, R2, R3, R4] → R4 done → [R1, R2, R3, R5]

   Result: ~2-3x throughput improvement

2. PAGEATTENTION (vLLM):
   - Manages KV cache like virtual memory
   - No memory waste from padding
   - Enables larger batches

3. TENSOR PARALLELISM:
   - Split model across multiple GPUs
   - Each GPU handles part of each layer
   - Good for large models

4. QUANTIZATION:
   - INT8/INT4 = faster computation
   - Fits larger batches in memory
   - ~2x throughput, slight quality loss

5. SPECULATIVE DECODING:
   - Small model proposes tokens
   - Large model verifies in parallel
   - ~2x throughput for certain workloads

THROUGHPUT COMPARISON (A100 GPU):

┌──────────────────┬─────────────────────┐
│ Setup            │ Tokens/sec          │
├──────────────────┼─────────────────────┤
│ Naive (batch=1)  │ 50-100              │
│ Basic batching   │ 500-1,000           │
│ vLLM             │ 2,000-5,000         │
│ vLLM + quantized │ 4,000-10,000        │
│ TensorRT-LLM     │ 5,000-15,000        │
└──────────────────┴─────────────────────┘

10-100x improvement with optimization!

Interactive Exercise

✎

Capacity Planning

Your system generates 2,000 tokens/second. Average request needs 500 output tokens. Peak load is 100 concurrent users. Can you handle this load? What's your requests/second capacity?

Pro Tips

Use vLLM or TensorRT-LLM for production throughput optimization
Plan for 2-3x peak load to handle spikes
Monitor queue depth to detect throughput bottlenecks
Consider auto-scaling to match throughput to demand

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms