Training / Regularization

Dropout

Intermediate [3/5]
Dropout regularization Random dropout

Definition

Dropout is a regularization technique that randomly sets a fraction of neuron outputs to zero during training. This prevents neurons from co-adapting too much and forces the network to learn more robust features that don't depend on any single neuron.

Dropout is like training an ensemble of networks that share parameters.

Key Concepts

  • Dropout rate: Probability of dropping each neuron (e.g., 0.1 = 10%)
  • Training only: Dropout is disabled during inference
  • Scaling: Activations scaled by 1/(1-p) to maintain expected values
  • Placement: Typically after activation functions

Examples

Visualization
How Dropout Works
DROPOUT DURING TRAINING (p=0.5): WITHOUT DROPOUT: Input → [●] → [●] → [●] → [●] → Output [●] → [●] → [●] → [●] [●] → [●] → [●] → [●] All neurons active, can co-adapt WITH DROPOUT (50% dropped each forward pass): Input → [●] → [○] → [●] → [○] → Output [○] → [●] → [○] → [●] [●] → [○] → [●] → [●] ● = active neuron ○ = dropped neuron (output = 0) Each training step uses different random mask! DURING INFERENCE: All neurons active, but scaled by (1-p): Input → [●] → [●] → [●] → [●] → Output [●] → [●] → [●] → [●] [●] → [●] → [●] → [●] (Or use "inverted dropout" - scale during training)
Implementation
Dropout in Transformers
DROPOUT IN TRANSFORMER ARCHITECTURE: class TransformerBlock(nn.Module): def __init__(self, d_model, n_heads, dropout=0.1): self.attn = MultiHeadAttention(d_model, n_heads) self.ff = FeedForward(d_model) self.norm1 = LayerNorm(d_model) self.norm2 = LayerNorm(d_model) self.dropout = nn.Dropout(dropout) def forward(self, x): # Attention with dropout attn_out = self.attn(x) attn_out = self.dropout(attn_out) # ← Dropout x = self.norm1(x + attn_out) # Feedforward with dropout ff_out = self.ff(x) ff_out = self.dropout(ff_out) # ← Dropout x = self.norm2(x + ff_out) return x DROPOUT LOCATIONS IN TRANSFORMERS: 1. After attention output projection 2. After feedforward layers 3. On attention weights (attention dropout) 4. On embeddings (embedding dropout) TYPICAL VALUES: ┌─────────────────────┬──────────────┐ │ Model │ Dropout Rate │ ├─────────────────────┼──────────────┤ │ BERT │ 0.1 │ │ GPT-2 │ 0.1 │ │ GPT-3/4 │ 0.0 - 0.1 │ │ Fine-tuning (small) │ 0.1 - 0.3 │ └─────────────────────┴──────────────┘

Interactive Exercise

Understand Dropout Scaling

With dropout p=0.5, if a neuron's output during training is 2.0 (when not dropped), what should its expected output be? Why do we need scaling?

Pro Tips
  • model.train() enables dropout; model.eval() disables it
  • Higher dropout for fully connected layers, lower for attention
  • Too much dropout → underfitting; too little → no regularization effect
  • DropConnect (dropping weights instead of activations) is a variant

Related Terms