Model Architecture / Transformer Components

Residual Connection

Advanced [4/5]
Skip connection Shortcut connection Identity mapping

Definition

A residual connection adds the input of a layer directly to its output, allowing the network to learn residual functions instead of complete transformations. This enables training of very deep networks by providing gradient highways.

Residual connections are essential for training transformers with dozens or hundreds of layers.

Key Concepts

  • Identity shortcut: Input bypasses the layer and adds to output
  • Gradient flow: Gradients can flow directly through skip connection
  • Residual learning: Layer learns change (residual) rather than full mapping
  • Depth enabling: Makes very deep networks trainable

Examples

Architecture
Residual Connection in Transformers
RESIDUAL CONNECTION STRUCTURE: WITHOUT RESIDUAL: x → [Sublayer] → output (learns complete transformation) WITH RESIDUAL: x → [Sublayer] → (+) → output ↘─────────────────↗ (skip connection) output = x + Sublayer(x) Layer learns the RESIDUAL: what to ADD to input TRANSFORMER BLOCK (both attention and FFN): ┌─────────────────────────────────────┐ │ x (input) │ │ ↓ │ │ ┌─────────┐ │ │ │ Norm │ │ │ └────┬────┘ │ │ ↓ │ │ ┌─────────────┐ │ │ │ Attention │ │ │ └──────┬──────┘ │ │ ↓ │ │ (+) ←──────── x (residual) │ │ ↓ │ │ ┌─────────┐ │ │ │ Norm │ │ │ └────┬────┘ │ │ ↓ │ │ ┌─────────────┐ │ │ │ FFN │ │ │ └──────┬──────┘ │ │ ↓ │ │ (+) ←──────── (from above) │ │ ↓ │ │ output │ └─────────────────────────────────────┘
Benefits
Why Residuals Enable Deep Learning
WHY RESIDUAL CONNECTIONS WORK: GRADIENT FLOW WITHOUT RESIDUALS: ∂L/∂x₁ = ∂L/∂x_n × ∂x_n/∂x_{n-1} × ... × ∂x₂/∂x₁ If each factor < 1: gradients vanish! If each factor > 1: gradients explode! Problem: Can't train deep networks WITH RESIDUALS: output = x + f(x) ∂output/∂x = 1 + ∂f/∂x The "1" ensures gradients always flow! Even if ∂f/∂x ≈ 0, gradient = 1 INTUITION: Layer doesn't need to learn identity (which is hard) - identity is FREE! Layer only learns what to CHANGE: - If input is already good: f(x) ≈ 0 - If modification needed: f(x) = change DEPTH COMPARISON: ┌───────────────┬──────────────────────┐ │ Architecture │ Max Trainable Depth │ ├───────────────┼──────────────────────┤ │ Plain network │ ~20 layers │ │ ResNet │ 1000+ layers │ │ Transformer │ 100+ layers │ └───────────────┴──────────────────────┘ Without residuals, GPT-4's ~100 layers would be impossible to train!

Interactive Exercise

Understand Residual Learning

If a layer's ideal function is to leave input unchanged (identity), what would the residual function f(x) need to learn?

Pro Tips
  • Pre-norm (LN before sublayer) + residual is most stable configuration
  • Residuals enable "layer dropping" during training for efficiency
  • Some layers can effectively be "turned off" (f(x)≈0)
  • Residual stream is where information accumulates through layers

Related Terms