Training / Methodology

Transfer Learning

Beginner [2/5]
Knowledge transfer Pre-train and fine-tune

Definition

Transfer learning is the practice of taking a model trained on one task and adapting it to a different but related task. Instead of training from scratch, you leverage knowledge already learned, dramatically reducing data and compute requirements.

Transfer learning is why you can fine-tune GPT or BERT on your specific task with just hundreds of examples instead of billions.

Key Concepts

  • Pre-training: Initial training on large general dataset
  • Fine-tuning: Further training on specific task
  • Feature extraction: Using pre-trained model as fixed feature extractor
  • Domain adaptation: Transferring between different domains

Examples

Overview
Transfer Learning Approaches
TRANSFER LEARNING STRATEGIES: 1. FEATURE EXTRACTION (freeze pre-trained) ┌─────────────────────────────┐ │ Pre-trained Model (frozen) │ ← Weights fixed │ Learned general features │ └──────────────┬──────────────┘ │ ┌──────────────▼──────────────┐ │ New Classification Head │ ← Only this trains │ (task-specific) │ └─────────────────────────────┘ 2. FINE-TUNING (train entire model) ┌─────────────────────────────┐ │ Pre-trained Model │ ← Low learning rate │ (trainable) │ └──────────────┬──────────────┘ │ ┌──────────────▼──────────────┐ │ New Classification Head │ ← Higher learning rate │ (task-specific) │ └─────────────────────────────┘ 3. GRADUAL UNFREEZING Epoch 1: Train only head Epoch 2: Unfreeze top layers Epoch 3: Unfreeze more layers ... Final: Entire model trainable BENEFITS: - Less data needed (1000s vs millions) - Faster training (hours vs weeks) - Better performance on small datasets - Leverages billions of $ of compute
Example
Fine-tuning BERT for Sentiment
TRANSFER LEARNING WITH BERT: PRE-TRAINING (done by researchers): - Trained on: Wikipedia + Books (3B words) - Task: Masked language modeling - Compute: Days on TPU clusters - Learns: General language understanding YOUR FINE-TUNING: - Your data: 5,000 movie reviews - Task: Sentiment classification - Compute: 30 minutes on single GPU - Learns: Domain-specific patterns # PyTorch example from transformers import BertForSequenceClassification # Load pre-trained BERT with classification head model = BertForSequenceClassification.from_pretrained( 'bert-base-uncased', num_labels=2 # positive/negative ) # Fine-tune on your data optimizer = AdamW(model.parameters(), lr=2e-5) # Low LR! for epoch in range(3): # Few epochs for batch in train_loader: outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() # Result: 93% accuracy with only 5,000 examples! # From scratch would need 500,000+ examples WHY THIS WORKS: BERT already knows: - Grammar and syntax - Word relationships - Sentiment words ("great", "terrible") You just teach it YOUR task format

Interactive Exercise

Choose Your Approach

You have 500 legal documents to classify into 5 categories. Should you:

A) Train from scratch
B) Feature extraction only
C) Full fine-tuning

Justify your choice.

Pro Tips
  • More similar source/target domains → better transfer
  • Fine-tuning LR should be 10-100x smaller than pre-training
  • With very little data, feature extraction may beat fine-tuning
  • Domain-specific pre-trained models (BioBERT, LegalBERT) transfer better

Related Terms