Model Parameters / Output Control

Max Tokens

Essential [1/5]
max_tokens Maximum length Output limit

Definition

Max tokens sets the maximum number of tokens the model can generate in its response. When this limit is reached, generation stops—even mid-sentence. It's a hard ceiling on output length.

This parameter controls costs (you pay per token) and prevents runaway generations.

Key Concepts

  • Hard limit: Generation stops abruptly at max_tokens
  • Cost control: Limits maximum spend per request
  • Not a target: Model may stop earlier naturally
  • Context budget: Input + output must fit in context window

Examples

Behavior
Max Tokens Effects
Prompt: "Explain quantum computing" max_tokens=50: "Quantum computing uses quantum mechanics principles like superposition and entanglement to process information. Unlike classical computers that use bits (0 or 1), quantum computers use qubits which can exist in multiple states simultane" ← CUT OFF mid-word! max_tokens=200: "Quantum computing uses quantum mechanics principles like superposition and entanglement to process information. Unlike classical computers that use bits (0 or 1), quantum computers use qubits which can exist in multiple states simultaneously. This allows quantum computers to solve certain problems exponentially faster than classical computers, particularly in areas like cryptography, optimization, and simulation of molecular systems." ← Complete response, stopped naturally max_tokens=1000: [Same as above - model finished naturally before limit] ← You still only pay for tokens actually generated
API Usage
Setting Max Tokens
# OpenAI response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Write a haiku"}], max_tokens=50 # Haiku is short, don't need more ) # Claude response = anthropic.messages.create( model="claude-3-opus", max_tokens=4096, # Required parameter for Claude messages=[...] ) # Typical values by use case: # Short answer/classification: 50-100 # Paragraph response: 200-500 # Long-form content: 1000-2000 # Code generation: 2000-4000 # Maximum (varies by model): 4096-8192+ # Important: Context window constraint # GPT-4: 8k/32k/128k context window # Input tokens + max_tokens ≤ context window # If input is 7000 tokens in 8k model: # max_tokens can be at most 1000

Interactive Exercise

Choose Max Tokens

What max_tokens value would you set for each task?

1. Sentiment classification (positive/negative)
2. Blog post introduction paragraph
3. Full technical documentation page
4. Yes/No question answer

Pro Tips
  • Set max_tokens based on expected output, not maximum possible
  • Check finish_reason to know if output was truncated
  • For Claude, max_tokens is required; for OpenAI it's optional
  • Leave buffer for complete sentences (don't set exactly needed)

Related Terms