Model Architecture / Transformer Variants

Decoder-Only

Advanced [4/5]
Autoregressive transformer GPT-style Causal LM

Definition

Decoder-only is a transformer architecture where the model generates text one token at a time, only attending to previous tokens (causal attention). This is the architecture used by GPT, Claude, LLaMA, and most modern LLMs.

The "decoder" name comes from the original transformer paper, where this component generated output sequences.

Key Concepts

  • Causal attention: Each token only sees tokens before it
  • Autoregressive: Generates one token at a time, left-to-right
  • Unified input/output: Same model handles both context and generation
  • Scalable: Architecture scales well to very large models

Examples

Architecture
Decoder-Only vs Other Architectures
TRANSFORMER ARCHITECTURE COMPARISON: ENCODER-ONLY (BERT): ┌─────────────────────────────────────┐ │ Input: "The [MASK] sat on the mat" │ │ ↓ │ │ Bidirectional Attention │ │ (sees all tokens) │ │ ↓ │ │ Output: Fill in [MASK] → "cat" │ └─────────────────────────────────────┘ Use: Classification, embeddings DECODER-ONLY (GPT, Claude, LLaMA): ┌─────────────────────────────────────┐ │ Input: "The cat sat on the" │ │ ↓ │ │ Causal Attention │ │ (each token sees only past) │ │ ↓ │ │ Output: Next token → "mat" │ └─────────────────────────────────────┘ Use: Text generation, chat, reasoning ENCODER-DECODER (T5, BART): ┌─────────────────────────────────────┐ │ Encoder: Process full input │ │ ↓ (cross-attention) │ │ Decoder: Generate output │ └─────────────────────────────────────┘ Use: Translation, summarization
Attention Pattern
Causal Attention Mask
CAUSAL (DECODER-ONLY) ATTENTION: Generating: "The cat sat" ATTENTION MASK: The cat sat The [ 1 0 0 ] ← "The" sees only itself cat [ 1 1 0 ] ← "cat" sees "The" + itself sat [ 1 1 1 ] ← "sat" sees all previous 1 = can attend, 0 = masked (cannot see) WHY CAUSAL? - Prevents "cheating" during training - Model can't peek at future tokens - Same during training and inference - Enables efficient generation GENERATION PROCESS: Step 1: "The" → predict next → "cat" Step 2: "The cat" → predict next → "sat" Step 3: "The cat sat" → predict next → "on" ... Each step, model only sees what it's generated so far.

Interactive Exercise

Understand Causal Attention

In a decoder-only model generating "I love AI", when predicting the token "AI", which tokens can it attend to?

Pro Tips
  • Decoder-only dominates because it's simpler and scales better
  • The "prompt" in GPT-style models is just prefilled decoder input
  • KV caching makes decoder-only generation efficient
  • Despite name, modern decoder-only models do "encoding" too

Related Terms