Feed-Forward Network | HyperKit.ai

Definition

The feed-forward network (FFN) is a component in each transformer layer that processes each token independently through two linear transformations with a non-linear activation in between. It's where much of the model's knowledge is stored.

FFNs provide the computational capacity for transformers to learn complex patterns and store factual knowledge.

Key Concepts

Position-wise: Same network applied independently to each token
Expansion: First layer expands dimensionality (typically 4x)
Activation: Non-linearity (ReLU, GELU, SwiGLU) between layers
Knowledge storage: Weights store learned facts and patterns

Examples

Architecture

FFN Structure in Transformers

TRANSFORMER LAYER STRUCTURE:

Input (d_model = 768)
        ↓
┌─────────────────────────────────┐
│    Multi-Head Attention         │  Token mixing
│    (tokens interact)            │
└─────────────────────────────────┘
        ↓
┌─────────────────────────────────┐
│    Feed-Forward Network         │  Feature processing
│    (tokens processed alone)     │
│                                 │
│    x → Linear(768→3072)         │  Expand 4x
│      → GELU activation          │  Non-linearity
│      → Linear(3072→768)         │  Project back
└─────────────────────────────────┘
        ↓
Output (d_model = 768)

FFN FORMULA:
FFN(x) = W₂ · activation(W₁ · x + b₁) + b₂

PARAMETERS:
W₁: (768, 3072) = 2.4M parameters
W₂: (3072, 768) = 2.4M parameters
Total per layer: ~4.8M params just in FFN!

In GPT-3 (96 layers):
FFN params = 96 × 4.8M ≈ 460M
(About 2/3 of total parameters are in FFNs!)

Function

What FFNs Do

FFN AS KNOWLEDGE STORAGE:

Research shows FFNs act like key-value memories:

Input: "The capital of France is"
           ↓
┌─────────────────────────────────┐
│ FFN First Layer (expand)        │
│ Activates neurons for:          │
│ - "capital" concept             │
│ - "France" concept              │
│ - Geographic knowledge          │
└─────────────────────────────────┘
           ↓
┌─────────────────────────────────┐
│ FFN Second Layer (project)      │
│ Combines activated patterns     │
│ → Retrieves "Paris" knowledge   │
└─────────────────────────────────┘

ATTENTION VS FFN:
┌─────────────────┬─────────────────┐
│   Attention     │      FFN        │
├─────────────────┼─────────────────┤
│ Token mixing    │ Token-wise      │
│ Context-aware   │ Context-free    │
│ Pattern matching│ Knowledge store │
│ "What to look   │ "What to do     │
│  at"            │  with it"       │
└─────────────────┴─────────────────┘

Analogy: Attention is like searching a library
         FFN is like the books themselves

Interactive Exercise

✎

Calculate FFN Parameters

If a model has d_model=1024 and FFN expansion factor of 4x, how many parameters are in one FFN layer? (Ignore biases)

Pro Tips

SwiGLU activation (LLaMA) uses 3 matrices, increasing FFN params
Mixture of Experts (MoE) scales FFN capacity efficiently
FFN neurons can be "edited" to fix model knowledge
Larger FFN ratio = more knowledge storage capacity

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms