Residual Connection | HyperKit.ai

Definition

A residual connection adds the input of a layer directly to its output, allowing the network to learn residual functions instead of complete transformations. This enables training of very deep networks by providing gradient highways.

Residual connections are essential for training transformers with dozens or hundreds of layers.

Key Concepts

Identity shortcut: Input bypasses the layer and adds to output
Gradient flow: Gradients can flow directly through skip connection
Residual learning: Layer learns change (residual) rather than full mapping
Depth enabling: Makes very deep networks trainable

Examples

Architecture

Residual Connection in Transformers

RESIDUAL CONNECTION STRUCTURE:

WITHOUT RESIDUAL:
x → [Sublayer] → output
    (learns complete transformation)

WITH RESIDUAL:
x → [Sublayer] → (+) → output
 ↘─────────────────↗
    (skip connection)

output = x + Sublayer(x)
Layer learns the RESIDUAL: what to ADD to input

TRANSFORMER BLOCK (both attention and FFN):
┌─────────────────────────────────────┐
│     x (input)                       │
│       ↓                             │
│  ┌─────────┐                        │
│  │  Norm   │                        │
│  └────┬────┘                        │
│       ↓                             │
│  ┌─────────────┐                    │
│  │  Attention  │                    │
│  └──────┬──────┘                    │
│         ↓                           │
│      (+) ←──────── x (residual)     │
│         ↓                           │
│  ┌─────────┐                        │
│  │  Norm   │                        │
│  └────┬────┘                        │
│       ↓                             │
│  ┌─────────────┐                    │
│  │    FFN      │                    │
│  └──────┬──────┘                    │
│         ↓                           │
│      (+) ←──────── (from above)     │
│         ↓                           │
│     output                          │
└─────────────────────────────────────┘

Benefits

Why Residuals Enable Deep Learning

WHY RESIDUAL CONNECTIONS WORK:

GRADIENT FLOW WITHOUT RESIDUALS:
∂L/∂x₁ = ∂L/∂x_n × ∂x_n/∂x_{n-1} × ... × ∂x₂/∂x₁

If each factor < 1: gradients vanish!
If each factor > 1: gradients explode!
Problem: Can't train deep networks

WITH RESIDUALS:
output = x + f(x)
∂output/∂x = 1 + ∂f/∂x

The "1" ensures gradients always flow!
Even if ∂f/∂x ≈ 0, gradient = 1

INTUITION:
Layer doesn't need to learn identity
(which is hard) - identity is FREE!

Layer only learns what to CHANGE:
- If input is already good: f(x) ≈ 0
- If modification needed: f(x) = change

DEPTH COMPARISON:
┌───────────────┬──────────────────────┐
│ Architecture  │ Max Trainable Depth  │
├───────────────┼──────────────────────┤
│ Plain network │ ~20 layers           │
│ ResNet        │ 1000+ layers         │
│ Transformer   │ 100+ layers          │
└───────────────┴──────────────────────┘

Without residuals, GPT-4's ~100 layers
would be impossible to train!

Interactive Exercise

✎

Understand Residual Learning

If a layer's ideal function is to leave input unchanged (identity), what would the residual function f(x) need to learn?

Pro Tips

Pre-norm (LN before sublayer) + residual is most stable configuration
Residuals enable "layer dropping" during training for efficiency
Some layers can effectively be "turned off" (f(x)≈0)
Residual stream is where information accumulates through layers

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms