Pre-training | HyperKit.ai

Definition

Pre-training is the initial phase of training a large language model on massive amounts of text data using self-supervised learning. The model learns general language understanding, world knowledge, and reasoning patterns that can later be fine-tuned for specific tasks.

Pre-training is what transforms random neural network weights into a capable language model.

Key Concepts

Self-supervised: Labels come from the data itself (next token prediction)
Scale: Trained on trillions of tokens from internet text
Compute-intensive: Requires thousands of GPUs for weeks/months
Foundation: Creates base model for fine-tuning

Examples

Process

How Pre-training Works

PRE-TRAINING PIPELINE:

1. DATA COLLECTION
   Sources: Web (Common Crawl), Books, Wikipedia,
            Code (GitHub), Academic papers
   Size: 1-15+ trillion tokens
   Filtering: Remove duplicates, low-quality, PII

2. TOKENIZATION
   Text → Tokens using BPE/WordPiece
   "Hello world" → [15496, 995]

3. TRAINING OBJECTIVE
   CAUSAL LM (GPT-style):
   Input:  "The cat sat on the ___"
   Target: "mat"
   Model predicts next token, updates weights

   MASKED LM (BERT-style):
   Input:  "The [MASK] sat on the mat"
   Target: "cat"
   Model predicts masked tokens

4. TRAINING LOOP
   for step in range(trillions):
       batch = sample_random_text_chunks()
       loss = model(batch)
       loss.backward()
       optimizer.step()

5. OUTPUT
   Base model with general language capabilities

SCALE OF PRE-TRAINING:
┌─────────────┬────────────┬────────────┬───────────┐
│ Model       │ Params     │ Tokens     │ Compute   │
├─────────────┼────────────┼────────────┼───────────┤
│ BERT        │ 340M       │ 3.3B       │ 4 TPU-day │
│ GPT-3       │ 175B       │ 300B       │ ~$4.6M    │
│ LLaMA 2     │ 70B        │ 2T         │ ~$10M+    │
│ GPT-4       │ ~1T?       │ ~10T?      │ ~$100M?   │
└─────────────┴────────────┴────────────┴───────────┘

What's Learned

Capabilities from Pre-training

WHAT PRE-TRAINING TEACHES:

1. LANGUAGE STRUCTURE
   - Grammar and syntax
   - Word relationships
   - Sentence structure
   "The dog [MASK] the ball" → "chased" not "chased the"

2. WORLD KNOWLEDGE
   - Facts: "The capital of France is [MASK]" → "Paris"
   - History: "World War II ended in [MASK]" → "1945"
   - Science: "Water boils at [MASK] degrees" → "100"

3. REASONING PATTERNS
   - If A then B patterns
   - Analogical reasoning
   - Basic logic

4. STYLE AND TONE
   - Formal vs informal language
   - Domain-specific vocabulary
   - Cultural context

5. CODE UNDERSTANDING
   - Syntax of programming languages
   - Common patterns and idioms
   - Documentation conventions

EMERGENT ABILITIES (appear at scale):
- Few-shot learning
- Chain-of-thought reasoning
- Instruction following (basic)
- Translation (zero-shot)
- Summarization

WHAT'S NOT LEARNED:
- Instruction following (needs fine-tuning)
- Safety/alignment (needs RLHF)
- Task-specific performance (needs fine-tuning)
- Recent information (knowledge cutoff)

Interactive Exercise

✎

Pre-training vs Fine-tuning

Why can't we just fine-tune from random weights instead of pre-training first? What would happen?

Pro Tips

Pre-training data quality matters more than quantity
Data deduplication prevents memorization artifacts
Curriculum learning (easy to hard) can improve efficiency
Multi-epoch training risks memorization; most use 1-2 epochs

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms