Model Architecture / Transformer Variants

Encoder-Decoder

Advanced [4/5]
Seq2Seq transformer T5-style Full transformer

Definition

Encoder-decoder is a transformer architecture with two components: an encoder that processes the full input bidirectionally, and a decoder that generates output autoregressively while attending to the encoder's representations.

This was the original transformer architecture from "Attention Is All You Need," designed for sequence-to-sequence tasks like translation.

Key Concepts

  • Encoder: Bidirectional processing of input sequence
  • Decoder: Autoregressive generation of output
  • Cross-attention: Decoder attends to encoder representations
  • Separation: Clear distinction between input processing and output generation

Examples

Architecture
Encoder-Decoder Structure
ENCODER-DECODER ARCHITECTURE: INPUT: "Hello, how are you?" ↓ ┌─────────────────────────────────┐ │ ENCODER │ │ ┌───────────────────────────┐ │ │ │ Bidirectional Self-Attn │ │ Each token sees │ │ (sees all input tokens) │ │ ALL other tokens │ └───────────────────────────┘ │ │ ↓ │ │ [h₁, h₂, h₃, h₄, h₅] │ Encoded representations └─────────────────────────────────┘ ↓ (cross-attention) ┌─────────────────────────────────┐ │ DECODER │ │ ┌───────────────────────────┐ │ │ │ Causal Self-Attention │ │ Output tokens see │ │ (sees past output only) │ │ only past output │ └───────────────────────────┘ │ │ ↓ │ │ ┌───────────────────────────┐ │ │ │ Cross-Attention │ │ Output attends to │ │ (attends to encoder) │ │ full input │ └───────────────────────────┘ │ └─────────────────────────────────┘ ↓ OUTPUT: "Bonjour, comment allez-vous?"
Comparison
When to Use Each Architecture
ARCHITECTURE COMPARISON: ENCODER-DECODER (T5, BART, mBART): ✓ Translation: "Hello" → "Hola" ✓ Summarization: Long text → Short summary ✓ Question answering with context ✓ Text-to-text tasks Advantages: - Full bidirectional understanding of input - Explicit separation of understanding vs generation - Good for tasks with distinct input/output Disadvantages: - More complex architecture - Harder to scale - Less flexible for chat/generation DECODER-ONLY (GPT, Claude, LLaMA): ✓ Chat and conversation ✓ General text generation ✓ Reasoning tasks ✓ Code generation Advantages: - Simpler architecture - Scales better - Single unified model - Better few-shot learning WHY DECODER-ONLY WON: 1. Simpler to scale 2. More flexible (can do everything) 3. Better at in-context learning 4. Easier to train with more data

Interactive Exercise

Choose Architecture

For each task, which architecture would be most natural?

A. Translating English to French
B. Writing a creative story from a prompt
C. Summarizing a news article

Pro Tips
  • T5 treats everything as text-to-text (even classification)
  • Cross-attention is what connects encoder and decoder
  • Modern decoder-only models can do translation too, just less efficiently
  • Encoder-decoder excels when input/output have different structures

Related Terms