Learning Rate | HyperKit.ai

Definition

Learning rate is a hyperparameter that controls how much to adjust model weights during training. It determines the step size when moving in the direction opposite to the gradient. Too high causes instability; too low causes slow training.

The learning rate is often the single most important hyperparameter to tune for successful training.

Key Concepts

Hyperparameter: Set before training, not learned from data
Schedules: Often decay learning rate during training
Warmup: Start low, increase, then decay
Typical values: 1e-5 to 1e-2 depending on optimizer and task

Examples

Update Rule

How Learning Rate Works

WEIGHT UPDATE FORMULA:

w_new = w_old - learning_rate × gradient

EXAMPLE:
Current weight: w = 0.5
Gradient: ∂L/∂w = 2.0 (loss increases as w increases)

With LR = 0.1:
w_new = 0.5 - 0.1 × 2.0 = 0.5 - 0.2 = 0.3
(Small step toward lower loss)

With LR = 0.01:
w_new = 0.5 - 0.01 × 2.0 = 0.5 - 0.02 = 0.48
(Tiny step - very slow progress)

With LR = 1.0:
w_new = 0.5 - 1.0 × 2.0 = 0.5 - 2.0 = -1.5
(Huge step - might overshoot minimum!)

TYPICAL LEARNING RATES:
┌────────────────────┬──────────────┐
│ Scenario           │ Learning Rate│
├────────────────────┼──────────────┤
│ Adam (pretraining) │ 1e-4 to 3e-4 │
│ Adam (fine-tuning) │ 1e-5 to 5e-5 │
│ SGD                │ 1e-2 to 1e-1 │
│ BERT fine-tune     │ 2e-5         │
│ LLM pretraining    │ 1e-4 to 6e-4 │
└────────────────────┴──────────────┘

Schedules

Learning Rate Schedules

LEARNING RATE SCHEDULES:

1. CONSTANT (simplest)
   LR stays the same throughout training
   lr = 1e-4

2. STEP DECAY
   Drop LR by factor every N epochs
   lr = initial_lr × 0.1^(epoch // 30)

3. COSINE ANNEALING (popular for LLMs)
   LR follows cosine curve to near-zero
   lr = lr_min + 0.5×(lr_max-lr_min)×(1+cos(π×t/T))

4. LINEAR WARMUP + DECAY
   Common for transformer training:

   LR ↑
      │    ╱╲
      │   ╱  ╲
      │  ╱    ╲
      │ ╱      ╲
      │╱        ╲____
      └─────────────────→ Steps
      warmup  peak  decay

5. ONE CYCLE
   Ramp up then down, good for fast training

# PyTorch example
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = AdamW(model.parameters(), lr=1e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=100)

for epoch in range(100):
    train()
    scheduler.step()  # Update learning rate

Interactive Exercise

✎

Diagnose Training Issues

Your training shows: Loss starts at 2.5, then goes 2.4, 2.6, 3.1, 4.2, NaN. What's likely wrong and how would you fix it?

Pro Tips

Use learning rate finder to find optimal range automatically
Warmup prevents early training instability in transformers
If loss spikes, reduce LR by 10x as first debugging step
Different parameter groups can have different learning rates

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms