Decoder-Only | HyperKit.ai

Definition

Decoder-only is a transformer architecture where the model generates text one token at a time, only attending to previous tokens (causal attention). This is the architecture used by GPT, Claude, LLaMA, and most modern LLMs.

The "decoder" name comes from the original transformer paper, where this component generated output sequences.

Key Concepts

Causal attention: Each token only sees tokens before it
Autoregressive: Generates one token at a time, left-to-right
Unified input/output: Same model handles both context and generation
Scalable: Architecture scales well to very large models

Examples

Architecture

Decoder-Only vs Other Architectures

TRANSFORMER ARCHITECTURE COMPARISON:

ENCODER-ONLY (BERT):
┌─────────────────────────────────────┐
│ Input: "The [MASK] sat on the mat" │
│            ↓                        │
│    Bidirectional Attention          │
│    (sees all tokens)                │
│            ↓                        │
│ Output: Fill in [MASK] → "cat"      │
└─────────────────────────────────────┘
Use: Classification, embeddings

DECODER-ONLY (GPT, Claude, LLaMA):
┌─────────────────────────────────────┐
│ Input: "The cat sat on the"        │
│            ↓                        │
│    Causal Attention                 │
│    (each token sees only past)      │
│            ↓                        │
│ Output: Next token → "mat"          │
└─────────────────────────────────────┘
Use: Text generation, chat, reasoning

ENCODER-DECODER (T5, BART):
┌─────────────────────────────────────┐
│ Encoder: Process full input         │
│    ↓ (cross-attention)              │
│ Decoder: Generate output            │
└─────────────────────────────────────┘
Use: Translation, summarization

Attention Pattern

Causal Attention Mask

CAUSAL (DECODER-ONLY) ATTENTION:

Generating: "The cat sat"

ATTENTION MASK:
        The  cat  sat
The   [  1    0    0  ]  ← "The" sees only itself
cat   [  1    1    0  ]  ← "cat" sees "The" + itself
sat   [  1    1    1  ]  ← "sat" sees all previous

1 = can attend, 0 = masked (cannot see)

WHY CAUSAL?
- Prevents "cheating" during training
- Model can't peek at future tokens
- Same during training and inference
- Enables efficient generation

GENERATION PROCESS:
Step 1: "The" → predict next → "cat"
Step 2: "The cat" → predict next → "sat"
Step 3: "The cat sat" → predict next → "on"
...

Each step, model only sees what it's generated so far.

Interactive Exercise

✎

Understand Causal Attention

In a decoder-only model generating "I love AI", when predicting the token "AI", which tokens can it attend to?

Pro Tips

Decoder-only dominates because it's simpler and scales better
The "prompt" in GPT-style models is just prefilled decoder input
KV caching makes decoder-only generation efficient
Despite name, modern decoder-only models do "encoding" too

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms