Dropout | HyperKit.ai

Definition

Dropout is a regularization technique that randomly sets a fraction of neuron outputs to zero during training. This prevents neurons from co-adapting too much and forces the network to learn more robust features that don't depend on any single neuron.

Dropout is like training an ensemble of networks that share parameters.

Key Concepts

Dropout rate: Probability of dropping each neuron (e.g., 0.1 = 10%)
Training only: Dropout is disabled during inference
Scaling: Activations scaled by 1/(1-p) to maintain expected values
Placement: Typically after activation functions

Examples

Visualization

How Dropout Works

DROPOUT DURING TRAINING (p=0.5):

WITHOUT DROPOUT:
Input → [●] → [●] → [●] → [●] → Output
        [●] → [●] → [●] → [●]
        [●] → [●] → [●] → [●]
All neurons active, can co-adapt

WITH DROPOUT (50% dropped each forward pass):
Input → [●] → [○] → [●] → [○] → Output
        [○] → [●] → [○] → [●]
        [●] → [○] → [●] → [●]

● = active neuron
○ = dropped neuron (output = 0)

Each training step uses different random mask!

DURING INFERENCE:
All neurons active, but scaled by (1-p):
Input → [●] → [●] → [●] → [●] → Output
        [●] → [●] → [●] → [●]
        [●] → [●] → [●] → [●]

(Or use "inverted dropout" - scale during training)

Implementation

Dropout in Transformers

DROPOUT IN TRANSFORMER ARCHITECTURE:

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, dropout=0.1):
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.ff = FeedForward(d_model)
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Attention with dropout
        attn_out = self.attn(x)
        attn_out = self.dropout(attn_out)  # ← Dropout
        x = self.norm1(x + attn_out)

        # Feedforward with dropout
        ff_out = self.ff(x)
        ff_out = self.dropout(ff_out)      # ← Dropout
        x = self.norm2(x + ff_out)

        return x

DROPOUT LOCATIONS IN TRANSFORMERS:
1. After attention output projection
2. After feedforward layers
3. On attention weights (attention dropout)
4. On embeddings (embedding dropout)

TYPICAL VALUES:
┌─────────────────────┬──────────────┐
│ Model               │ Dropout Rate │
├─────────────────────┼──────────────┤
│ BERT                │ 0.1          │
│ GPT-2               │ 0.1          │
│ GPT-3/4             │ 0.0 - 0.1    │
│ Fine-tuning (small) │ 0.1 - 0.3    │
└─────────────────────┴──────────────┘

Interactive Exercise

✎

Understand Dropout Scaling

With dropout p=0.5, if a neuron's output during training is 2.0 (when not dropped), what should its expected output be? Why do we need scaling?

Pro Tips

model.train() enables dropout; model.eval() disables it
Higher dropout for fully connected layers, lower for attention
Too much dropout → underfitting; too little → no regularization effect
DropConnect (dropping weights instead of activations) is a variant

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms