Layer Normalization | HyperKit.ai

Definition

Layer normalization normalizes the activations across the feature dimension for each sample independently. Unlike batch normalization, it doesn't depend on batch statistics, making it ideal for transformers and variable-length sequences.

LayerNorm stabilizes training by keeping activations in a consistent range throughout the network.

Key Concepts

Per-sample: Normalizes each sample independently
Feature-wise: Statistics computed across feature dimension
Learnable: Scale (γ) and shift (β) parameters
Position: Pre-LN vs Post-LN affects training stability

Examples

Mathematics

LayerNorm Formula

LAYER NORMALIZATION FORMULA:

For input x with feature dimension d:

μ = (1/d) × Σᵢ xᵢ           # mean across features
σ² = (1/d) × Σᵢ (xᵢ - μ)²   # variance

LayerNorm(x) = γ × (x - μ)/√(σ² + ε) + β

Where:
- ε = small constant (1e-5) for stability
- γ = learned scale (initialized to 1)
- β = learned shift (initialized to 0)

EXAMPLE:
Token embedding: x = [2.0, 4.0, 6.0, 8.0]

μ = (2+4+6+8)/4 = 5.0
σ² = ((2-5)² + (4-5)² + (6-5)² + (8-5)²)/4 = 5.0
σ = √5 ≈ 2.24

Normalized: [−1.34, −0.45, 0.45, 1.34]

After scale/shift (γ=1, β=0):
Same: [−1.34, −0.45, 0.45, 1.34]

Result: mean≈0, variance≈1

Placement

Pre-LN vs Post-LN

LAYER NORM PLACEMENT IN TRANSFORMERS:

POST-LN (Original Transformer):
x → Attention → Add & Norm → FFN → Add & Norm
                    ↑                    ↑
              After sublayer      After sublayer

PRE-LN (Modern Default - GPT, LLaMA):
x → Norm → Attention → Add → Norm → FFN → Add
      ↑                        ↑
Before sublayer          Before sublayer

COMPARISON:
┌─────────────┬──────────────────────────────┐
│             │  Post-LN    │   Pre-LN       │
├─────────────┼─────────────┼────────────────┤
│ Stability   │ Less stable │ More stable    │
│ Warmup      │ Required    │ Often optional │
│ Deep models │ Harder      │ Easier         │
│ Performance │ Slightly +  │ Slightly -     │
└─────────────┴─────────────┴────────────────┘

WHY PRE-LN IS PREFERRED:
- Gradients flow better through residual path
- Enables training very deep models (100+ layers)
- Less sensitive to learning rate
- No learning rate warmup needed

RMSNorm (LLaMA, etc.):
Simpler variant without mean subtraction
RMSNorm(x) = x / √(mean(x²) + ε) × γ

Interactive Exercise

✎

LayerNorm vs BatchNorm

Why do transformers use LayerNorm instead of BatchNorm? List at least 2 reasons.

Pro Tips

RMSNorm is 10-15% faster than LayerNorm with similar results
Pre-LN enables training without learning rate warmup
Final layer often has LayerNorm before output projection
Normalization scale affects model calibration

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms