Latency | HyperKit.ai

Definition

Latency measures the time delay in an LLM system, from when a request is sent to when a response (or first token) is received. Low latency is critical for interactive applications where users expect quick responses.

For LLMs, latency includes prompt processing time, model inference time, and network overhead.

Key Concepts

Time to First Token (TTFT): Delay before first output token appears
Time per Output Token (TPOT): Time to generate each subsequent token
End-to-end latency: Total time from request to complete response
Percentiles (p50, p95, p99): Latency at different user experience levels

Examples

Breakdown

LLM Latency Components

LATENCY BREAKDOWN:

Total Latency = Network + TTFT + (TPOT × output_tokens)

┌──────────────────────────────────────────────────┐
│           LLM Request Timeline                    │
├──────────────────────────────────────────────────┤
│                                                   │
│ Request ─────► Server                            │
│    └─ Network latency (~20-100ms)                │
│                                                   │
│ Server: Process prompt                           │
│    └─ Tokenization (~1-5ms)                      │
│    └─ Prefill: all prompt tokens (~50-500ms)     │
│         └─ = TTFT (Time to First Token)         │
│                                                   │
│ Server: Generate response                        │
│    └─ Token 1: 50ms  ──► stream to client       │
│    └─ Token 2: 50ms  ──► stream to client       │
│    └─ Token 3: 50ms  ──► stream to client       │
│    └─ ... (TPOT per token)                      │
│                                                   │
│ Response ◄───── Server                           │
│    └─ Network latency (~20-100ms)                │
│                                                   │
└──────────────────────────────────────────────────┘

EXAMPLE CALCULATION:
- Network RTT: 50ms
- TTFT: 200ms (1000 token prompt)
- TPOT: 30ms per token
- Output tokens: 100

Total = 50 + 200 + (30 × 100) + 50
      = 50 + 200 + 3000 + 50
      = 3300ms (3.3 seconds)

WITH STREAMING:
- User sees first token at: 50 + 200 = 250ms
- Feels faster even though total is same!

Optimization

Reducing LLM Latency

LATENCY OPTIMIZATION TECHNIQUES:

1. REDUCE TTFT (Prefill Time):
   - Shorter prompts (fewer tokens)
   - Prompt caching (reuse common prefixes)
   - Better hardware (faster GPUs)
   - Quantization (faster computation)

2. REDUCE TPOT (Generation Speed):
   - Speculative decoding (~2x faster)
   - Smaller models (trades quality)
   - Quantization (INT8, INT4)
   - Optimized inference (vLLM, TensorRT)

3. STREAMING:
   # Without streaming: wait for all tokens
   response = model.generate(prompt)  # 3s total
   print(response)

   # With streaming: see tokens as generated
   for token in model.stream(prompt):
       print(token, end="")  # 250ms to first char
   # Total still 3s, but feels instant!

4. CACHING:
   - Response caching (exact match)
   - Semantic caching (similar queries)
   - KV cache reuse (conversation context)

5. ROUTING:
   - Small model for simple queries
   - Large model for complex queries
   - Reduces average latency

LATENCY TARGETS:
┌─────────────────┬────────────────────────┐
│ Use Case        │ Target TTFT            │
├─────────────────┼────────────────────────┤
│ Chatbot         │ < 500ms                │
│ Code completion │ < 200ms                │
│ Search          │ < 300ms                │
│ Batch processing│ Not latency sensitive  │
└─────────────────┴────────────────────────┘

Interactive Exercise

✎

Calculate Total Latency

TTFT: 150ms, TPOT: 25ms, Output tokens: 200, Network RTT: 40ms

Calculate: Total latency and when user sees first token (with streaming).

Pro Tips

Always use streaming for interactive applications - even if total time is same
Optimize TTFT first - it's what users notice most
Track p95 and p99 latency, not just average
Consider geographic distribution - deploy closer to users

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms