Max Tokens | HyperKit.ai

Definition

Max tokens sets the maximum number of tokens the model can generate in its response. When this limit is reached, generation stops—even mid-sentence. It's a hard ceiling on output length.

This parameter controls costs (you pay per token) and prevents runaway generations.

Key Concepts

Hard limit: Generation stops abruptly at max_tokens
Cost control: Limits maximum spend per request
Not a target: Model may stop earlier naturally
Context budget: Input + output must fit in context window

Examples

Behavior

Max Tokens Effects

Prompt: "Explain quantum computing"

max_tokens=50:
"Quantum computing uses quantum mechanics principles like
superposition and entanglement to process information.
Unlike classical computers that use bits (0 or 1), quantum
computers use qubits which can exist in multiple states
simultane"  ← CUT OFF mid-word!

max_tokens=200:
"Quantum computing uses quantum mechanics principles like
superposition and entanglement to process information.
Unlike classical computers that use bits (0 or 1), quantum
computers use qubits which can exist in multiple states
simultaneously. This allows quantum computers to solve
certain problems exponentially faster than classical
computers, particularly in areas like cryptography,
optimization, and simulation of molecular systems."
← Complete response, stopped naturally

max_tokens=1000:
[Same as above - model finished naturally before limit]
← You still only pay for tokens actually generated

API Usage

Setting Max Tokens

# OpenAI
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a haiku"}],
    max_tokens=50  # Haiku is short, don't need more
)

# Claude
response = anthropic.messages.create(
    model="claude-3-opus",
    max_tokens=4096,  # Required parameter for Claude
    messages=[...]
)

# Typical values by use case:
# Short answer/classification: 50-100
# Paragraph response: 200-500
# Long-form content: 1000-2000
# Code generation: 2000-4000
# Maximum (varies by model): 4096-8192+

# Important: Context window constraint
# GPT-4: 8k/32k/128k context window
# Input tokens + max_tokens ≤ context window
# If input is 7000 tokens in 8k model:
#   max_tokens can be at most 1000

Interactive Exercise

✎

Choose Max Tokens

What max_tokens value would you set for each task?

1. Sentiment classification (positive/negative)
2. Blog post introduction paragraph
3. Full technical documentation page
4. Yes/No question answer

Pro Tips

Set max_tokens based on expected output, not maximum possible
Check finish_reason to know if output was truncated
For Claude, max_tokens is required; for OpenAI it's optional
Leave buffer for complete sentences (don't set exactly needed)

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms