Model Architecture / Transformer Components

Feed-Forward Network

Advanced [4/5]
FFN MLP Position-wise FFN

Definition

The feed-forward network (FFN) is a component in each transformer layer that processes each token independently through two linear transformations with a non-linear activation in between. It's where much of the model's knowledge is stored.

FFNs provide the computational capacity for transformers to learn complex patterns and store factual knowledge.

Key Concepts

  • Position-wise: Same network applied independently to each token
  • Expansion: First layer expands dimensionality (typically 4x)
  • Activation: Non-linearity (ReLU, GELU, SwiGLU) between layers
  • Knowledge storage: Weights store learned facts and patterns

Examples

Architecture
FFN Structure in Transformers
TRANSFORMER LAYER STRUCTURE: Input (d_model = 768) ↓ ┌─────────────────────────────────┐ │ Multi-Head Attention │ Token mixing │ (tokens interact) │ └─────────────────────────────────┘ ↓ ┌─────────────────────────────────┐ │ Feed-Forward Network │ Feature processing │ (tokens processed alone) │ │ │ │ x → Linear(768→3072) │ Expand 4x │ → GELU activation │ Non-linearity │ → Linear(3072→768) │ Project back └─────────────────────────────────┘ ↓ Output (d_model = 768) FFN FORMULA: FFN(x) = W₂ · activation(W₁ · x + b₁) + b₂ PARAMETERS: W₁: (768, 3072) = 2.4M parameters W₂: (3072, 768) = 2.4M parameters Total per layer: ~4.8M params just in FFN! In GPT-3 (96 layers): FFN params = 96 × 4.8M ≈ 460M (About 2/3 of total parameters are in FFNs!)
Function
What FFNs Do
FFN AS KNOWLEDGE STORAGE: Research shows FFNs act like key-value memories: Input: "The capital of France is" ↓ ┌─────────────────────────────────┐ │ FFN First Layer (expand) │ │ Activates neurons for: │ │ - "capital" concept │ │ - "France" concept │ │ - Geographic knowledge │ └─────────────────────────────────┘ ↓ ┌─────────────────────────────────┐ │ FFN Second Layer (project) │ │ Combines activated patterns │ │ → Retrieves "Paris" knowledge │ └─────────────────────────────────┘ ATTENTION VS FFN: ┌─────────────────┬─────────────────┐ │ Attention │ FFN │ ├─────────────────┼─────────────────┤ │ Token mixing │ Token-wise │ │ Context-aware │ Context-free │ │ Pattern matching│ Knowledge store │ │ "What to look │ "What to do │ │ at" │ with it" │ └─────────────────┴─────────────────┘ Analogy: Attention is like searching a library FFN is like the books themselves

Interactive Exercise

Calculate FFN Parameters

If a model has d_model=1024 and FFN expansion factor of 4x, how many parameters are in one FFN layer? (Ignore biases)

Pro Tips
  • SwiGLU activation (LLaMA) uses 3 matrices, increasing FFN params
  • Mixture of Experts (MoE) scales FFN capacity efficiently
  • FFN neurons can be "edited" to fix model knowledge
  • Larger FFN ratio = more knowledge storage capacity

Related Terms