Encoder-Decoder | HyperKit.ai

Definition

Encoder-decoder is a transformer architecture with two components: an encoder that processes the full input bidirectionally, and a decoder that generates output autoregressively while attending to the encoder's representations.

This was the original transformer architecture from "Attention Is All You Need," designed for sequence-to-sequence tasks like translation.

Key Concepts

Encoder: Bidirectional processing of input sequence
Decoder: Autoregressive generation of output
Cross-attention: Decoder attends to encoder representations
Separation: Clear distinction between input processing and output generation

Examples

Architecture

Encoder-Decoder Structure

ENCODER-DECODER ARCHITECTURE:

INPUT: "Hello, how are you?"
                ↓
┌─────────────────────────────────┐
│           ENCODER               │
│  ┌───────────────────────────┐  │
│  │ Bidirectional Self-Attn   │  │ Each token sees
│  │ (sees all input tokens)   │  │ ALL other tokens
│  └───────────────────────────┘  │
│              ↓                  │
│  [h₁, h₂, h₃, h₄, h₅]         │ Encoded representations
└─────────────────────────────────┘
                ↓ (cross-attention)
┌─────────────────────────────────┐
│           DECODER               │
│  ┌───────────────────────────┐  │
│  │ Causal Self-Attention     │  │ Output tokens see
│  │ (sees past output only)   │  │ only past output
│  └───────────────────────────┘  │
│              ↓                  │
│  ┌───────────────────────────┐  │
│  │ Cross-Attention           │  │ Output attends to
│  │ (attends to encoder)      │  │ full input
│  └───────────────────────────┘  │
└─────────────────────────────────┘
                ↓
OUTPUT: "Bonjour, comment allez-vous?"

Comparison

When to Use Each Architecture

ARCHITECTURE COMPARISON:

ENCODER-DECODER (T5, BART, mBART):
✓ Translation: "Hello" → "Hola"
✓ Summarization: Long text → Short summary
✓ Question answering with context
✓ Text-to-text tasks

Advantages:
- Full bidirectional understanding of input
- Explicit separation of understanding vs generation
- Good for tasks with distinct input/output

Disadvantages:
- More complex architecture
- Harder to scale
- Less flexible for chat/generation

DECODER-ONLY (GPT, Claude, LLaMA):
✓ Chat and conversation
✓ General text generation
✓ Reasoning tasks
✓ Code generation

Advantages:
- Simpler architecture
- Scales better
- Single unified model
- Better few-shot learning

WHY DECODER-ONLY WON:
1. Simpler to scale
2. More flexible (can do everything)
3. Better at in-context learning
4. Easier to train with more data

Interactive Exercise

✎

Choose Architecture

For each task, which architecture would be most natural?

A. Translating English to French
B. Writing a creative story from a prompt
C. Summarizing a news article

Pro Tips

T5 treats everything as text-to-text (even classification)
Cross-attention is what connects encoder and decoder
Modern decoder-only models can do translation too, just less efficiently
Encoder-decoder excels when input/output have different structures

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms