Training / Hyperparameters

Learning Rate

Beginner [2/5]
LR Step size Alpha (α)

Definition

Learning rate is a hyperparameter that controls how much to adjust model weights during training. It determines the step size when moving in the direction opposite to the gradient. Too high causes instability; too low causes slow training.

The learning rate is often the single most important hyperparameter to tune for successful training.

Key Concepts

  • Hyperparameter: Set before training, not learned from data
  • Schedules: Often decay learning rate during training
  • Warmup: Start low, increase, then decay
  • Typical values: 1e-5 to 1e-2 depending on optimizer and task

Examples

Update Rule
How Learning Rate Works
WEIGHT UPDATE FORMULA: w_new = w_old - learning_rate × gradient EXAMPLE: Current weight: w = 0.5 Gradient: ∂L/∂w = 2.0 (loss increases as w increases) With LR = 0.1: w_new = 0.5 - 0.1 × 2.0 = 0.5 - 0.2 = 0.3 (Small step toward lower loss) With LR = 0.01: w_new = 0.5 - 0.01 × 2.0 = 0.5 - 0.02 = 0.48 (Tiny step - very slow progress) With LR = 1.0: w_new = 0.5 - 1.0 × 2.0 = 0.5 - 2.0 = -1.5 (Huge step - might overshoot minimum!) TYPICAL LEARNING RATES: ┌────────────────────┬──────────────┐ │ Scenario │ Learning Rate│ ├────────────────────┼──────────────┤ │ Adam (pretraining) │ 1e-4 to 3e-4 │ │ Adam (fine-tuning) │ 1e-5 to 5e-5 │ │ SGD │ 1e-2 to 1e-1 │ │ BERT fine-tune │ 2e-5 │ │ LLM pretraining │ 1e-4 to 6e-4 │ └────────────────────┴──────────────┘
Schedules
Learning Rate Schedules
LEARNING RATE SCHEDULES: 1. CONSTANT (simplest) LR stays the same throughout training lr = 1e-4 2. STEP DECAY Drop LR by factor every N epochs lr = initial_lr × 0.1^(epoch // 30) 3. COSINE ANNEALING (popular for LLMs) LR follows cosine curve to near-zero lr = lr_min + 0.5×(lr_max-lr_min)×(1+cos(π×t/T)) 4. LINEAR WARMUP + DECAY Common for transformer training: LR ↑ │ ╱╲ │ ╱ ╲ │ ╱ ╲ │ ╱ ╲ │╱ ╲____ └─────────────────→ Steps warmup peak decay 5. ONE CYCLE Ramp up then down, good for fast training # PyTorch example from torch.optim.lr_scheduler import CosineAnnealingLR optimizer = AdamW(model.parameters(), lr=1e-4) scheduler = CosineAnnealingLR(optimizer, T_max=100) for epoch in range(100): train() scheduler.step() # Update learning rate

Interactive Exercise

Diagnose Training Issues

Your training shows: Loss starts at 2.5, then goes 2.4, 2.6, 3.1, 4.2, NaN. What's likely wrong and how would you fix it?

Pro Tips
  • Use learning rate finder to find optimal range automatically
  • Warmup prevents early training instability in transformers
  • If loss spikes, reduce LR by 10x as first debugging step
  • Different parameter groups can have different learning rates

Related Terms