Performance / Metrics

Latency

Foundational [2/5]
Response time Time to first token (TTFT) End-to-end latency

Definition

Latency measures the time delay in an LLM system, from when a request is sent to when a response (or first token) is received. Low latency is critical for interactive applications where users expect quick responses.

For LLMs, latency includes prompt processing time, model inference time, and network overhead.

Key Concepts

  • Time to First Token (TTFT): Delay before first output token appears
  • Time per Output Token (TPOT): Time to generate each subsequent token
  • End-to-end latency: Total time from request to complete response
  • Percentiles (p50, p95, p99): Latency at different user experience levels

Examples

Breakdown
LLM Latency Components
LATENCY BREAKDOWN: Total Latency = Network + TTFT + (TPOT × output_tokens) ┌──────────────────────────────────────────────────┐ │ LLM Request Timeline │ ├──────────────────────────────────────────────────┤ │ │ │ Request ─────► Server │ │ └─ Network latency (~20-100ms) │ │ │ │ Server: Process prompt │ │ └─ Tokenization (~1-5ms) │ │ └─ Prefill: all prompt tokens (~50-500ms) │ │ └─ = TTFT (Time to First Token) │ │ │ │ Server: Generate response │ │ └─ Token 1: 50ms ──► stream to client │ │ └─ Token 2: 50ms ──► stream to client │ │ └─ Token 3: 50ms ──► stream to client │ │ └─ ... (TPOT per token) │ │ │ │ Response ◄───── Server │ │ └─ Network latency (~20-100ms) │ │ │ └──────────────────────────────────────────────────┘ EXAMPLE CALCULATION: - Network RTT: 50ms - TTFT: 200ms (1000 token prompt) - TPOT: 30ms per token - Output tokens: 100 Total = 50 + 200 + (30 × 100) + 50 = 50 + 200 + 3000 + 50 = 3300ms (3.3 seconds) WITH STREAMING: - User sees first token at: 50 + 200 = 250ms - Feels faster even though total is same!
Optimization
Reducing LLM Latency
LATENCY OPTIMIZATION TECHNIQUES: 1. REDUCE TTFT (Prefill Time): - Shorter prompts (fewer tokens) - Prompt caching (reuse common prefixes) - Better hardware (faster GPUs) - Quantization (faster computation) 2. REDUCE TPOT (Generation Speed): - Speculative decoding (~2x faster) - Smaller models (trades quality) - Quantization (INT8, INT4) - Optimized inference (vLLM, TensorRT) 3. STREAMING: # Without streaming: wait for all tokens response = model.generate(prompt) # 3s total print(response) # With streaming: see tokens as generated for token in model.stream(prompt): print(token, end="") # 250ms to first char # Total still 3s, but feels instant! 4. CACHING: - Response caching (exact match) - Semantic caching (similar queries) - KV cache reuse (conversation context) 5. ROUTING: - Small model for simple queries - Large model for complex queries - Reduces average latency LATENCY TARGETS: ┌─────────────────┬────────────────────────┐ │ Use Case │ Target TTFT │ ├─────────────────┼────────────────────────┤ │ Chatbot │ < 500ms │ │ Code completion │ < 200ms │ │ Search │ < 300ms │ │ Batch processing│ Not latency sensitive │ └─────────────────┴────────────────────────┘

Interactive Exercise

Calculate Total Latency

TTFT: 150ms, TPOT: 25ms, Output tokens: 200, Network RTT: 40ms

Calculate: Total latency and when user sees first token (with streaming).

Pro Tips
  • Always use streaming for interactive applications - even if total time is same
  • Optimize TTFT first - it's what users notice most
  • Track p95 and p99 latency, not just average
  • Consider geographic distribution - deploy closer to users

Related Terms