Training / Generalization

Regularization

Intermediate [3/5]
Weight penalty Shrinkage Constraint

Definition

Regularization encompasses techniques that prevent overfitting by adding constraints or penalties to the learning process. This encourages simpler models that generalize better to unseen data rather than memorizing training examples.

Regularization is essential for training models that work well in the real world, not just on training data.

Key Concepts

  • Penalty terms: Add to loss function to discourage complexity
  • Weight decay: L2 regularization applied via optimizer
  • Dropout: Randomly disable neurons during training
  • Early stopping: Stop before overfitting occurs

Examples

Mathematics
L1 and L2 Regularization
REGULARIZATION FORMULAS: Original loss: L(θ) = CrossEntropy(y, ŷ) L2 REGULARIZATION (Ridge): L_reg = L(θ) + λ × Σ w² - Penalizes large weights quadratically - Encourages small, distributed weights - Most common for neural networks L1 REGULARIZATION (Lasso): L_reg = L(θ) + λ × Σ |w| - Penalizes large weights linearly - Encourages sparse weights (many zeros) - Good for feature selection ELASTIC NET (L1 + L2): L_reg = L(θ) + λ₁×Σ|w| + λ₂×Σw² EFFECT ON WEIGHTS: Without regularization: weights = [5.2, -8.1, 0.01, 12.3, -0.5] (Some very large values) With L2 regularization: weights = [1.2, -1.8, 0.01, 2.1, -0.3] (All weights shrunk toward zero) With L1 regularization: weights = [1.5, -2.0, 0.0, 2.5, 0.0] (Some weights exactly zero → sparse)
Practice
Regularization in Practice
COMMON REGULARIZATION TECHNIQUES: 1. WEIGHT DECAY (L2 via optimizer) # Most common in modern training optimizer = AdamW( model.parameters(), lr=1e-4, weight_decay=0.01 # λ = 0.01 ) 2. DROPOUT # Randomly zero activations class Model(nn.Module): def __init__(self): self.dropout = nn.Dropout(0.1) def forward(self, x): x = self.layer1(x) x = self.dropout(x) # 10% dropped return self.layer2(x) 3. LAYER NORMALIZATION # Stabilizes training, mild regularization x = LayerNorm(x) 4. LABEL SMOOTHING # Prevent overconfident predictions loss = CrossEntropy(pred, target, label_smoothing=0.1) # Target: [0, 0, 1, 0] → [0.025, 0.025, 0.925, 0.025] 5. DATA AUGMENTATION # More diverse training data - Back-translation for text - Random crops for images TYPICAL VALUES: ┌──────────────────┬─────────────────┐ │ Technique │ Typical Range │ ├──────────────────┼─────────────────┤ │ Weight decay │ 0.01 - 0.1 │ │ Dropout │ 0.1 - 0.5 │ │ Label smoothing │ 0.1 │ └──────────────────┴─────────────────┘

Interactive Exercise

Choose Regularization

You're fine-tuning BERT on 1,000 examples. Training accuracy reaches 98% but validation is 75%. Which regularization techniques would you apply and why?

Pro Tips
  • Weight decay is standard—rarely train without it
  • Too much regularization → underfitting (high bias)
  • Fine-tuning needs more regularization than pretraining
  • Combine multiple techniques for best results

Related Terms