Training / Optimization

Backpropagation

Intermediate [3/5]
Backprop Backward pass Reverse-mode autodiff

Definition

Backpropagation is the algorithm that computes gradients of the loss function with respect to each weight in the network. By applying the chain rule backwards through layers, it efficiently determines how each weight should change to reduce the loss.

Backpropagation is the foundation of how all neural networks learn, including LLMs with billions of parameters.

Key Concepts

  • Chain rule: Computes gradients through composed functions
  • Computational graph: Records operations during forward pass
  • Gradient flow: Gradients propagate from loss back to inputs
  • Efficient: One backward pass computes all gradients

Examples

Intuition
Forward and Backward Pass
NEURAL NETWORK TRAINING CYCLE: FORWARD PASS (compute prediction): Input → Layer 1 → Layer 2 → Layer 3 → Output x → h1 → h2 → h3 → ŷ Loss = CrossEntropy(ŷ, y_true) BACKWARD PASS (compute gradients): ∂L/∂ŷ ← ∂L/∂h3 ← ∂L/∂h2 ← ∂L/∂h1 ← (start from loss) Chain rule at each step: ∂L/∂W1 = ∂L/∂h1 × ∂h1/∂W1 UPDATE WEIGHTS: W1 = W1 - learning_rate × ∂L/∂W1 ANALOGY: Imagine water flowing downhill (forward pass) Backprop traces where each drop came from to understand how upstream changes affect downstream
Example
Backprop Through Simple Network
SIMPLE NETWORK: y = sigmoid(w2 × relu(w1 × x + b1) + b2) Given: x=2, w1=0.5, b1=0.1, w2=0.3, b2=0.2 Target: y_true = 1 FORWARD PASS: z1 = w1×x + b1 = 0.5×2 + 0.1 = 1.1 h1 = relu(z1) = 1.1 z2 = w2×h1 + b2 = 0.3×1.1 + 0.2 = 0.53 y = sigmoid(z2) = 0.629 Loss = -(y_true×log(y)) = -log(0.629) = 0.464 BACKWARD PASS: ∂L/∂y = -1/y = -1.59 ∂y/∂z2 = sigmoid(z2)×(1-sigmoid(z2)) = 0.233 ∂L/∂z2 = -1.59 × 0.233 = -0.371 ∂L/∂w2 = ∂L/∂z2 × h1 = -0.371 × 1.1 = -0.408 ∂L/∂b2 = ∂L/∂z2 = -0.371 ∂L/∂h1 = ∂L/∂z2 × w2 = -0.371 × 0.3 = -0.111 ∂L/∂z1 = -0.111 × 1 = -0.111 (relu grad=1 since z1>0) ∂L/∂w1 = ∂L/∂z1 × x = -0.111 × 2 = -0.222 ∂L/∂b1 = ∂L/∂z1 = -0.111 Gradients tell us: increase w1, w2 to reduce loss!

Interactive Exercise

Gradient Flow Understanding

In a network with 100 layers, why might gradients become very small (vanish) or very large (explode) during backpropagation?

Pro Tips
  • Automatic differentiation (autograd) handles backprop automatically
  • Skip connections (ResNet) help gradients flow through deep networks
  • Gradient checkpointing trades compute for memory in large models
  • Mixed precision training uses FP16 forward, FP32 gradients

Related Terms