Regularization | HyperKit.ai

Definition

Regularization encompasses techniques that prevent overfitting by adding constraints or penalties to the learning process. This encourages simpler models that generalize better to unseen data rather than memorizing training examples.

Regularization is essential for training models that work well in the real world, not just on training data.

Key Concepts

Penalty terms: Add to loss function to discourage complexity
Weight decay: L2 regularization applied via optimizer
Dropout: Randomly disable neurons during training
Early stopping: Stop before overfitting occurs

Examples

Mathematics

L1 and L2 Regularization

REGULARIZATION FORMULAS:

Original loss: L(θ) = CrossEntropy(y, ŷ)

L2 REGULARIZATION (Ridge):
L_reg = L(θ) + λ × Σ w²

- Penalizes large weights quadratically
- Encourages small, distributed weights
- Most common for neural networks

L1 REGULARIZATION (Lasso):
L_reg = L(θ) + λ × Σ |w|

- Penalizes large weights linearly
- Encourages sparse weights (many zeros)
- Good for feature selection

ELASTIC NET (L1 + L2):
L_reg = L(θ) + λ₁×Σ|w| + λ₂×Σw²

EFFECT ON WEIGHTS:
Without regularization:
weights = [5.2, -8.1, 0.01, 12.3, -0.5]
(Some very large values)

With L2 regularization:
weights = [1.2, -1.8, 0.01, 2.1, -0.3]
(All weights shrunk toward zero)

With L1 regularization:
weights = [1.5, -2.0, 0.0, 2.5, 0.0]
(Some weights exactly zero → sparse)

Practice

Regularization in Practice

COMMON REGULARIZATION TECHNIQUES:

1. WEIGHT DECAY (L2 via optimizer)
   # Most common in modern training
   optimizer = AdamW(
       model.parameters(),
       lr=1e-4,
       weight_decay=0.01  # λ = 0.01
   )

2. DROPOUT
   # Randomly zero activations
   class Model(nn.Module):
       def __init__(self):
           self.dropout = nn.Dropout(0.1)
       def forward(self, x):
           x = self.layer1(x)
           x = self.dropout(x)  # 10% dropped
           return self.layer2(x)

3. LAYER NORMALIZATION
   # Stabilizes training, mild regularization
   x = LayerNorm(x)

4. LABEL SMOOTHING
   # Prevent overconfident predictions
   loss = CrossEntropy(pred, target, label_smoothing=0.1)
   # Target: [0, 0, 1, 0] → [0.025, 0.025, 0.925, 0.025]

5. DATA AUGMENTATION
   # More diverse training data
   - Back-translation for text
   - Random crops for images

TYPICAL VALUES:
┌──────────────────┬─────────────────┐
│ Technique        │ Typical Range   │
├──────────────────┼─────────────────┤
│ Weight decay     │ 0.01 - 0.1      │
│ Dropout          │ 0.1 - 0.5       │
│ Label smoothing  │ 0.1             │
└──────────────────┴─────────────────┘

Interactive Exercise

✎

Choose Regularization

You're fine-tuning BERT on 1,000 examples. Training accuracy reaches 98% but validation is 75%. Which regularization techniques would you apply and why?

Pro Tips

Weight decay is standard—rarely train without it
Too much regularization → underfitting (high bias)
Fine-tuning needs more regularization than pretraining
Combine multiple techniques for best results

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms