Streaming | HyperKit.ai

Definition

Streaming delivers LLM output incrementally, token by token, rather than waiting for the complete response. Users see text appear progressively, dramatically improving perceived responsiveness.

This is essential for chat interfaces where waiting 10+ seconds for long responses would feel unresponsive.

Key Concepts

Time to first token (TTFT): How quickly first character appears
Server-sent events (SSE): HTTP protocol for streaming
Delta chunks: Each stream message contains new tokens
Progressive rendering: Display text as it arrives

Examples

Comparison

Streaming vs Non-Streaming UX

NON-STREAMING:
User sends message...
[                    ] 0s
[████                ] 3s  (waiting...)
[████████            ] 6s  (still waiting...)
[████████████        ] 9s  (nothing visible)
[████████████████████] 12s
"Here is your complete 500-word response..."
← User sees NOTHING until 12 seconds, then everything

STREAMING:
User sends message...
[█                   ] 0.5s → "Here"
[██                  ] 0.8s → "Here is"
[███                 ] 1.1s → "Here is your"
[████                ] 1.4s → "Here is your response"
...text continues appearing...
[████████████████████] 12s → Complete!
← User sees text after 0.5 seconds, reads along

Same total time, dramatically better experience!

Implementation

Streaming with APIs

# OpenAI streaming
from openai import OpenAI
client = OpenAI()

stream = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a story"}],
    stream=True  # Enable streaming
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

# Claude streaming
import anthropic
client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-3-opus",
    max_tokens=1000,
    messages=[{"role": "user", "content": "Write a poem"}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

# Frontend (JavaScript)
const response = await fetch('/api/chat', {
    method: 'POST',
    body: JSON.stringify({ message: userInput })
});

const reader = response.body.getReader();
while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    displayChunk(new TextDecoder().decode(value));
}

Interactive Exercise

✎

Streaming Decision

Which scenarios should use streaming?

1. Chat interface for customer support
2. Background batch processing of 1000 documents
3. Code completion in an IDE
4. Async email generation (user not waiting)

Pro Tips

Always stream for user-facing chat interfaces
Non-streaming is simpler for background processing
Handle connection drops gracefully in stream handlers
Consider buffering for smoother word-by-word display

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms