Backpropagation | HyperKit.ai

Definition

Backpropagation is the algorithm that computes gradients of the loss function with respect to each weight in the network. By applying the chain rule backwards through layers, it efficiently determines how each weight should change to reduce the loss.

Backpropagation is the foundation of how all neural networks learn, including LLMs with billions of parameters.

Key Concepts

Chain rule: Computes gradients through composed functions
Computational graph: Records operations during forward pass
Gradient flow: Gradients propagate from loss back to inputs
Efficient: One backward pass computes all gradients

Examples

Intuition

Forward and Backward Pass

NEURAL NETWORK TRAINING CYCLE:

FORWARD PASS (compute prediction):
Input → Layer 1 → Layer 2 → Layer 3 → Output
  x   →   h1    →   h2    →   h3    →   ŷ

Loss = CrossEntropy(ŷ, y_true)

BACKWARD PASS (compute gradients):
∂L/∂ŷ ← ∂L/∂h3 ← ∂L/∂h2 ← ∂L/∂h1 ← (start from loss)

Chain rule at each step:
∂L/∂W1 = ∂L/∂h1 × ∂h1/∂W1

UPDATE WEIGHTS:
W1 = W1 - learning_rate × ∂L/∂W1

ANALOGY:
Imagine water flowing downhill (forward pass)
Backprop traces where each drop came from
to understand how upstream changes affect downstream

Example

Backprop Through Simple Network

SIMPLE NETWORK: y = sigmoid(w2 × relu(w1 × x + b1) + b2)

Given: x=2, w1=0.5, b1=0.1, w2=0.3, b2=0.2
Target: y_true = 1

FORWARD PASS:
z1 = w1×x + b1 = 0.5×2 + 0.1 = 1.1
h1 = relu(z1) = 1.1
z2 = w2×h1 + b2 = 0.3×1.1 + 0.2 = 0.53
y = sigmoid(z2) = 0.629

Loss = -(y_true×log(y)) = -log(0.629) = 0.464

BACKWARD PASS:
∂L/∂y = -1/y = -1.59
∂y/∂z2 = sigmoid(z2)×(1-sigmoid(z2)) = 0.233
∂L/∂z2 = -1.59 × 0.233 = -0.371

∂L/∂w2 = ∂L/∂z2 × h1 = -0.371 × 1.1 = -0.408
∂L/∂b2 = ∂L/∂z2 = -0.371

∂L/∂h1 = ∂L/∂z2 × w2 = -0.371 × 0.3 = -0.111
∂L/∂z1 = -0.111 × 1 = -0.111 (relu grad=1 since z1>0)

∂L/∂w1 = ∂L/∂z1 × x = -0.111 × 2 = -0.222
∂L/∂b1 = ∂L/∂z1 = -0.111

Gradients tell us: increase w1, w2 to reduce loss!

Interactive Exercise

✎

Gradient Flow Understanding

In a network with 100 layers, why might gradients become very small (vanish) or very large (explode) during backpropagation?

Pro Tips

Automatic differentiation (autograd) handles backprop automatically
Skip connections (ResNet) help gradients flow through deep networks
Gradient checkpointing trades compute for memory in large models
Mixed precision training uses FP16 forward, FP32 gradients

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms