LLM APIs & Integration / API Features

Streaming

Beginner [2/5]
Stream mode Server-sent events Token streaming

Definition

Streaming delivers LLM output incrementally, token by token, rather than waiting for the complete response. Users see text appear progressively, dramatically improving perceived responsiveness.

This is essential for chat interfaces where waiting 10+ seconds for long responses would feel unresponsive.

Key Concepts

  • Time to first token (TTFT): How quickly first character appears
  • Server-sent events (SSE): HTTP protocol for streaming
  • Delta chunks: Each stream message contains new tokens
  • Progressive rendering: Display text as it arrives

Examples

Comparison
Streaming vs Non-Streaming UX
NON-STREAMING: User sends message... [ ] 0s [████ ] 3s (waiting...) [████████ ] 6s (still waiting...) [████████████ ] 9s (nothing visible) [████████████████████] 12s "Here is your complete 500-word response..." ← User sees NOTHING until 12 seconds, then everything STREAMING: User sends message... [█ ] 0.5s → "Here" [██ ] 0.8s → "Here is" [███ ] 1.1s → "Here is your" [████ ] 1.4s → "Here is your response" ...text continues appearing... [████████████████████] 12s → Complete! ← User sees text after 0.5 seconds, reads along Same total time, dramatically better experience!
Implementation
Streaming with APIs
# OpenAI streaming from openai import OpenAI client = OpenAI() stream = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Write a story"}], stream=True # Enable streaming ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) # Claude streaming import anthropic client = anthropic.Anthropic() with client.messages.stream( model="claude-3-opus", max_tokens=1000, messages=[{"role": "user", "content": "Write a poem"}] ) as stream: for text in stream.text_stream: print(text, end="", flush=True) # Frontend (JavaScript) const response = await fetch('/api/chat', { method: 'POST', body: JSON.stringify({ message: userInput }) }); const reader = response.body.getReader(); while (true) { const { done, value } = await reader.read(); if (done) break; displayChunk(new TextDecoder().decode(value)); }

Interactive Exercise

Streaming Decision

Which scenarios should use streaming?

1. Chat interface for customer support
2. Background batch processing of 1000 documents
3. Code completion in an IDE
4. Async email generation (user not waiting)

Pro Tips
  • Always stream for user-facing chat interfaces
  • Non-streaming is simpler for background processing
  • Handle connection drops gracefully in stream handlers
  • Consider buffering for smoother word-by-word display

Related Terms