Model Architecture / Generation

Autoregressive

Intermediate [3/5]
AR model Left-to-right generation Sequential generation

Definition

Autoregressive describes models that generate output one element at a time, where each new element depends on all previously generated elements. The model "regresses" on its own previous outputs, hence "auto" (self) + "regressive."

All major LLMs (GPT, Claude, LLaMA) are autoregressive, generating text token by token.

Key Concepts

  • Sequential: One token at a time, left to right
  • Conditional: Each token conditioned on all previous
  • Sampling: Probability distribution → choose next token
  • No backtracking: Can't revise earlier tokens

Examples

Process
Autoregressive Generation
AUTOREGRESSIVE GENERATION: Prompt: "The weather today is" Step 1: P(token₁ | "The weather today is") → Sample "sunny" (p=0.3) Step 2: P(token₂ | "The weather today is sunny") → Sample "and" (p=0.4) Step 3: P(token₃ | "The weather today is sunny and") → Sample "warm" (p=0.35) Step 4: P(token₄ | "The weather today is sunny and warm") → Sample "." (p=0.5) Final: "The weather today is sunny and warm." PROBABILITY CHAIN: P(sequence) = P(t₁|prompt) × P(t₂|prompt,t₁) × P(t₃|...) × ... Each token sees EVERYTHING before it! GENERATION LOOP: while not done: logits = model(context) # Full forward pass probs = softmax(logits[-1]) # Last position next_token = sample(probs) # Choose one context.append(next_token) # Add to context if next_token == EOS: break
Comparison
Autoregressive vs Non-Autoregressive
AUTOREGRESSIVE (GPT, Claude, LLaMA): ┌───────────────────────────────────────┐ │ Input: "Translate: Hello" │ │ ↓ │ │ Step 1: Generate "Hola" (sees input) │ │ ↓ │ │ Step 2: Generate "." (sees "Hola") │ │ ↓ │ │ Output: "Hola." │ └───────────────────────────────────────┘ Pro: High quality, flexible length Con: Slow (sequential) NON-AUTOREGRESSIVE (some translation models): ┌───────────────────────────────────────┐ │ Input: "Translate: Hello" │ │ ↓ │ │ Predict all positions at once: │ │ [Hola] [.] [PAD] [PAD] │ │ ↓ │ │ Output: "Hola." │ └───────────────────────────────────────┘ Pro: Fast (parallel) Con: Lower quality, fixed length issues DIFFUSION MODELS (Midjourney, DALL-E): Not autoregressive! Generate all at once through iterative denoising. WHY AUTOREGRESSIVE DOMINATES FOR TEXT: - Language is naturally sequential - Each word depends heavily on context - Flexible output length - Can "think out loud" (chain of thought)

Interactive Exercise

Understand Generation Order

An autoregressive LLM generates "I love cats" from prompt "I". What's the probability formula for this sequence?

Pro Tips
  • Autoregressive generation is inherently sequential → limits speed
  • KV caching makes generation O(n) instead of O(n²) per token
  • Speculative decoding uses small model to speed up generation
  • Parallel sampling (batch multiple continuations) improves throughput

Related Terms