Performance / Metrics

Throughput

Foundational [2/5]
Tokens per second Requests per second QPS (queries per second)

Definition

Throughput measures the volume of work an LLM system can process over time, typically expressed as tokens per second or requests per second. High throughput is essential for serving many users efficiently and keeping infrastructure costs manageable.

Throughput and latency often have an inverse relationship - optimizing for one may hurt the other.

Key Concepts

  • Tokens/second: Raw generation speed across all requests
  • Requests/second: Number of complete requests handled
  • Batching: Processing multiple requests together for efficiency
  • Capacity planning: Matching throughput to expected load

Examples

Comparison
Throughput vs Latency Trade-offs
THROUGHPUT VS LATENCY: SCENARIO 1: Single request (optimize latency) ┌─────────────────────────────────────┐ │ GPU: Processing 1 request │ │ Latency: 100ms │ │ Throughput: 10 req/s │ │ GPU utilization: 20% │ └─────────────────────────────────────┘ Fast for user, but inefficient! SCENARIO 2: Batch of 8 (optimize throughput) ┌─────────────────────────────────────┐ │ GPU: Processing 8 requests together │ │ Latency: 150ms (50% higher) │ │ Throughput: 53 req/s (5x higher!) │ │ GPU utilization: 90% │ └─────────────────────────────────────┘ Slower per request, but 5x more efficient! CONTINUOUS BATCHING (best of both): ┌─────────────────────────────────────┐ │ New requests added as others finish │ │ GPU always processing full batch │ │ Latency: ~120ms (only 20% higher) │ │ Throughput: 45 req/s (4.5x higher!) │ │ GPU utilization: 85% │ └─────────────────────────────────────┘ THE TRADE-OFF CURVE: Latency ↑ │ ╲ │ ╲ Pareto frontier │ ╲ │ ╲__________ │ ╲ └─────────────────────► Throughput You can't have both optimal latency AND optimal throughput - pick your priority!
Optimization
Maximizing Throughput
THROUGHPUT OPTIMIZATION TECHNIQUES: 1. CONTINUOUS BATCHING (vLLM, TGI): # Instead of waiting for batch to complete # Add new requests as slots free up Traditional: [R1, R2, R3, R4] → wait → [R5, R6, R7, R8] Continuous: [R1, R2, R3, R4] → R4 done → [R1, R2, R3, R5] Result: ~2-3x throughput improvement 2. PAGEATTENTION (vLLM): - Manages KV cache like virtual memory - No memory waste from padding - Enables larger batches 3. TENSOR PARALLELISM: - Split model across multiple GPUs - Each GPU handles part of each layer - Good for large models 4. QUANTIZATION: - INT8/INT4 = faster computation - Fits larger batches in memory - ~2x throughput, slight quality loss 5. SPECULATIVE DECODING: - Small model proposes tokens - Large model verifies in parallel - ~2x throughput for certain workloads THROUGHPUT COMPARISON (A100 GPU): ┌──────────────────┬─────────────────────┐ │ Setup │ Tokens/sec │ ├──────────────────┼─────────────────────┤ │ Naive (batch=1) │ 50-100 │ │ Basic batching │ 500-1,000 │ │ vLLM │ 2,000-5,000 │ │ vLLM + quantized │ 4,000-10,000 │ │ TensorRT-LLM │ 5,000-15,000 │ └──────────────────┴─────────────────────┘ 10-100x improvement with optimization!

Interactive Exercise

Capacity Planning

Your system generates 2,000 tokens/second. Average request needs 500 output tokens. Peak load is 100 concurrent users. Can you handle this load? What's your requests/second capacity?

Pro Tips
  • Use vLLM or TensorRT-LLM for production throughput optimization
  • Plan for 2-3x peak load to handle spikes
  • Monitor queue depth to detect throughput bottlenecks
  • Consider auto-scaling to match throughput to demand

Related Terms