Transfer Learning | HyperKit.ai

Definition

Transfer learning is the practice of taking a model trained on one task and adapting it to a different but related task. Instead of training from scratch, you leverage knowledge already learned, dramatically reducing data and compute requirements.

Transfer learning is why you can fine-tune GPT or BERT on your specific task with just hundreds of examples instead of billions.

Key Concepts

Pre-training: Initial training on large general dataset
Fine-tuning: Further training on specific task
Feature extraction: Using pre-trained model as fixed feature extractor
Domain adaptation: Transferring between different domains

Examples

Overview

Transfer Learning Approaches

TRANSFER LEARNING STRATEGIES:

1. FEATURE EXTRACTION (freeze pre-trained)
   ┌─────────────────────────────┐
   │  Pre-trained Model (frozen) │ ← Weights fixed
   │  Learned general features   │
   └──────────────┬──────────────┘
                  │
   ┌──────────────▼──────────────┐
   │   New Classification Head   │ ← Only this trains
   │   (task-specific)           │
   └─────────────────────────────┘

2. FINE-TUNING (train entire model)
   ┌─────────────────────────────┐
   │  Pre-trained Model          │ ← Low learning rate
   │  (trainable)                │
   └──────────────┬──────────────┘
                  │
   ┌──────────────▼──────────────┐
   │   New Classification Head   │ ← Higher learning rate
   │   (task-specific)           │
   └─────────────────────────────┘

3. GRADUAL UNFREEZING
   Epoch 1: Train only head
   Epoch 2: Unfreeze top layers
   Epoch 3: Unfreeze more layers
   ...
   Final: Entire model trainable

BENEFITS:
- Less data needed (1000s vs millions)
- Faster training (hours vs weeks)
- Better performance on small datasets
- Leverages billions of $ of compute

Example

Fine-tuning BERT for Sentiment

TRANSFER LEARNING WITH BERT:

PRE-TRAINING (done by researchers):
- Trained on: Wikipedia + Books (3B words)
- Task: Masked language modeling
- Compute: Days on TPU clusters
- Learns: General language understanding

YOUR FINE-TUNING:
- Your data: 5,000 movie reviews
- Task: Sentiment classification
- Compute: 30 minutes on single GPU
- Learns: Domain-specific patterns

# PyTorch example
from transformers import BertForSequenceClassification

# Load pre-trained BERT with classification head
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2  # positive/negative
)

# Fine-tune on your data
optimizer = AdamW(model.parameters(), lr=2e-5)  # Low LR!

for epoch in range(3):  # Few epochs
    for batch in train_loader:
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Result: 93% accuracy with only 5,000 examples!
# From scratch would need 500,000+ examples

WHY THIS WORKS:
BERT already knows:
- Grammar and syntax
- Word relationships
- Sentiment words ("great", "terrible")
You just teach it YOUR task format

Interactive Exercise

✎

Choose Your Approach

You have 500 legal documents to classify into 5 categories. Should you:

A) Train from scratch
B) Feature extraction only
C) Full fine-tuning

Justify your choice.

Pro Tips

More similar source/target domains → better transfer
Fine-tuning LR should be 10-100x smaller than pre-training
With very little data, feature extraction may beat fine-tuning
Domain-specific pre-trained models (BioBERT, LegalBERT) transfer better

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms