Training / Methodology

Pre-training

Beginner [2/5]
Pre-train Base model training Foundation training

Definition

Pre-training is the initial phase of training a large language model on massive amounts of text data using self-supervised learning. The model learns general language understanding, world knowledge, and reasoning patterns that can later be fine-tuned for specific tasks.

Pre-training is what transforms random neural network weights into a capable language model.

Key Concepts

  • Self-supervised: Labels come from the data itself (next token prediction)
  • Scale: Trained on trillions of tokens from internet text
  • Compute-intensive: Requires thousands of GPUs for weeks/months
  • Foundation: Creates base model for fine-tuning

Examples

Process
How Pre-training Works
PRE-TRAINING PIPELINE: 1. DATA COLLECTION Sources: Web (Common Crawl), Books, Wikipedia, Code (GitHub), Academic papers Size: 1-15+ trillion tokens Filtering: Remove duplicates, low-quality, PII 2. TOKENIZATION Text → Tokens using BPE/WordPiece "Hello world" → [15496, 995] 3. TRAINING OBJECTIVE CAUSAL LM (GPT-style): Input: "The cat sat on the ___" Target: "mat" Model predicts next token, updates weights MASKED LM (BERT-style): Input: "The [MASK] sat on the mat" Target: "cat" Model predicts masked tokens 4. TRAINING LOOP for step in range(trillions): batch = sample_random_text_chunks() loss = model(batch) loss.backward() optimizer.step() 5. OUTPUT Base model with general language capabilities SCALE OF PRE-TRAINING: ┌─────────────┬────────────┬────────────┬───────────┐ │ Model │ Params │ Tokens │ Compute │ ├─────────────┼────────────┼────────────┼───────────┤ │ BERT │ 340M │ 3.3B │ 4 TPU-day │ │ GPT-3 │ 175B │ 300B │ ~$4.6M │ │ LLaMA 2 │ 70B │ 2T │ ~$10M+ │ │ GPT-4 │ ~1T? │ ~10T? │ ~$100M? │ └─────────────┴────────────┴────────────┴───────────┘
What's Learned
Capabilities from Pre-training
WHAT PRE-TRAINING TEACHES: 1. LANGUAGE STRUCTURE - Grammar and syntax - Word relationships - Sentence structure "The dog [MASK] the ball" → "chased" not "chased the" 2. WORLD KNOWLEDGE - Facts: "The capital of France is [MASK]" → "Paris" - History: "World War II ended in [MASK]" → "1945" - Science: "Water boils at [MASK] degrees" → "100" 3. REASONING PATTERNS - If A then B patterns - Analogical reasoning - Basic logic 4. STYLE AND TONE - Formal vs informal language - Domain-specific vocabulary - Cultural context 5. CODE UNDERSTANDING - Syntax of programming languages - Common patterns and idioms - Documentation conventions EMERGENT ABILITIES (appear at scale): - Few-shot learning - Chain-of-thought reasoning - Instruction following (basic) - Translation (zero-shot) - Summarization WHAT'S NOT LEARNED: - Instruction following (needs fine-tuning) - Safety/alignment (needs RLHF) - Task-specific performance (needs fine-tuning) - Recent information (knowledge cutoff)

Interactive Exercise

Pre-training vs Fine-tuning

Why can't we just fine-tune from random weights instead of pre-training first? What would happen?

Pro Tips
  • Pre-training data quality matters more than quantity
  • Data deduplication prevents memorization artifacts
  • Curriculum learning (easy to hard) can improve efficiency
  • Multi-epoch training risks memorization; most use 1-2 epochs

Related Terms