Autoregressive | HyperKit.ai

Definition

Autoregressive describes models that generate output one element at a time, where each new element depends on all previously generated elements. The model "regresses" on its own previous outputs, hence "auto" (self) + "regressive."

All major LLMs (GPT, Claude, LLaMA) are autoregressive, generating text token by token.

Key Concepts

Sequential: One token at a time, left to right
Conditional: Each token conditioned on all previous
Sampling: Probability distribution → choose next token
No backtracking: Can't revise earlier tokens

Examples

Process

Autoregressive Generation

AUTOREGRESSIVE GENERATION:

Prompt: "The weather today is"

Step 1: P(token₁ | "The weather today is")
        → Sample "sunny" (p=0.3)

Step 2: P(token₂ | "The weather today is sunny")
        → Sample "and" (p=0.4)

Step 3: P(token₃ | "The weather today is sunny and")
        → Sample "warm" (p=0.35)

Step 4: P(token₄ | "The weather today is sunny and warm")
        → Sample "." (p=0.5)

Final: "The weather today is sunny and warm."

PROBABILITY CHAIN:
P(sequence) = P(t₁|prompt) × P(t₂|prompt,t₁) × P(t₃|...) × ...

Each token sees EVERYTHING before it!

GENERATION LOOP:
while not done:
    logits = model(context)     # Full forward pass
    probs = softmax(logits[-1]) # Last position
    next_token = sample(probs)  # Choose one
    context.append(next_token)  # Add to context
    if next_token == EOS: break

Comparison

Autoregressive vs Non-Autoregressive

AUTOREGRESSIVE (GPT, Claude, LLaMA):
┌───────────────────────────────────────┐
│ Input: "Translate: Hello"             │
│        ↓                              │
│ Step 1: Generate "Hola" (sees input)  │
│        ↓                              │
│ Step 2: Generate "." (sees "Hola")    │
│        ↓                              │
│ Output: "Hola."                       │
└───────────────────────────────────────┘
Pro: High quality, flexible length
Con: Slow (sequential)

NON-AUTOREGRESSIVE (some translation models):
┌───────────────────────────────────────┐
│ Input: "Translate: Hello"             │
│        ↓                              │
│ Predict all positions at once:        │
│ [Hola] [.] [PAD] [PAD]               │
│        ↓                              │
│ Output: "Hola."                       │
└───────────────────────────────────────┘
Pro: Fast (parallel)
Con: Lower quality, fixed length issues

DIFFUSION MODELS (Midjourney, DALL-E):
Not autoregressive! Generate all at once
through iterative denoising.

WHY AUTOREGRESSIVE DOMINATES FOR TEXT:
- Language is naturally sequential
- Each word depends heavily on context
- Flexible output length
- Can "think out loud" (chain of thought)

Interactive Exercise

✎

Understand Generation Order

An autoregressive LLM generates "I love cats" from prompt "I". What's the probability formula for this sequence?

Pro Tips

Autoregressive generation is inherently sequential → limits speed
KV caching makes generation O(n) instead of O(n²) per token
Speculative decoding uses small model to speed up generation
Parallel sampling (batch multiple continuations) improves throughput

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms