Model Architecture / Transformer Components

Layer Normalization

Advanced [4/5]
LayerNorm LN

Definition

Layer normalization normalizes the activations across the feature dimension for each sample independently. Unlike batch normalization, it doesn't depend on batch statistics, making it ideal for transformers and variable-length sequences.

LayerNorm stabilizes training by keeping activations in a consistent range throughout the network.

Key Concepts

  • Per-sample: Normalizes each sample independently
  • Feature-wise: Statistics computed across feature dimension
  • Learnable: Scale (γ) and shift (β) parameters
  • Position: Pre-LN vs Post-LN affects training stability

Examples

Mathematics
LayerNorm Formula
LAYER NORMALIZATION FORMULA: For input x with feature dimension d: μ = (1/d) × Σᵢ xᵢ # mean across features σ² = (1/d) × Σᵢ (xᵢ - μ)² # variance LayerNorm(x) = γ × (x - μ)/√(σ² + ε) + β Where: - ε = small constant (1e-5) for stability - γ = learned scale (initialized to 1) - β = learned shift (initialized to 0) EXAMPLE: Token embedding: x = [2.0, 4.0, 6.0, 8.0] μ = (2+4+6+8)/4 = 5.0 σ² = ((2-5)² + (4-5)² + (6-5)² + (8-5)²)/4 = 5.0 σ = √5 ≈ 2.24 Normalized: [−1.34, −0.45, 0.45, 1.34] After scale/shift (γ=1, β=0): Same: [−1.34, −0.45, 0.45, 1.34] Result: mean≈0, variance≈1
Placement
Pre-LN vs Post-LN
LAYER NORM PLACEMENT IN TRANSFORMERS: POST-LN (Original Transformer): x → Attention → Add & Norm → FFN → Add & Norm ↑ ↑ After sublayer After sublayer PRE-LN (Modern Default - GPT, LLaMA): x → Norm → Attention → Add → Norm → FFN → Add ↑ ↑ Before sublayer Before sublayer COMPARISON: ┌─────────────┬──────────────────────────────┐ │ │ Post-LN │ Pre-LN │ ├─────────────┼─────────────┼────────────────┤ │ Stability │ Less stable │ More stable │ │ Warmup │ Required │ Often optional │ │ Deep models │ Harder │ Easier │ │ Performance │ Slightly + │ Slightly - │ └─────────────┴─────────────┴────────────────┘ WHY PRE-LN IS PREFERRED: - Gradients flow better through residual path - Enables training very deep models (100+ layers) - Less sensitive to learning rate - No learning rate warmup needed RMSNorm (LLaMA, etc.): Simpler variant without mean subtraction RMSNorm(x) = x / √(mean(x²) + ε) × γ

Interactive Exercise

LayerNorm vs BatchNorm

Why do transformers use LayerNorm instead of BatchNorm? List at least 2 reasons.

Pro Tips
  • RMSNorm is 10-15% faster than LayerNorm with similar results
  • Pre-LN enables training without learning rate warmup
  • Final layer often has LayerNorm before output projection
  • Normalization scale affects model calibration

Related Terms