Training / Optimization

Loss Function

Intermediate [3/5]
Cost function Objective function Error function

Definition

A loss function measures how wrong a model's predictions are compared to the true values. Training minimizes this loss, guiding the model toward better predictions. The choice of loss function shapes what the model learns to optimize.

Loss functions are the feedback signal that enables neural networks to learn from data.

Key Concepts

  • Differentiable: Must support gradient computation for backpropagation
  • Task-specific: Different tasks need different loss functions
  • Minimization: Training aims to reduce loss toward zero
  • Aggregation: Typically averaged across batch samples

Examples

Common Types
Loss Functions by Task
COMMON LOSS FUNCTIONS: 1. CROSS-ENTROPY (Classification, LLMs) Loss = -Σ y_true × log(y_pred) Used for: Next token prediction, classification Perfect prediction → Loss = 0 Wrong prediction → Loss → ∞ 2. MEAN SQUARED ERROR (Regression) Loss = (1/n) × Σ(y_true - y_pred)² Used for: Continuous values, embeddings Penalizes large errors more heavily 3. BINARY CROSS-ENTROPY (Binary Classification) Loss = -[y×log(p) + (1-y)×log(1-p)] Used for: Yes/no decisions, multi-label 4. CONTRASTIVE LOSS (Embeddings) Pull similar pairs together Push different pairs apart Used for: Sentence embeddings, retrieval 5. KL DIVERGENCE (Distribution Matching) Measures difference between distributions Used for: VAEs, knowledge distillation
Example
LLM Training Loss
LANGUAGE MODEL TRAINING: Input: "The cat sat on the" Target: "mat" Model predicts probability distribution: ┌─────────┬────────────┐ │ Token │ Probability│ ├─────────┼────────────┤ │ mat │ 0.60 │ ← target │ floor │ 0.20 │ │ chair │ 0.10 │ │ dog │ 0.05 │ │ ... │ 0.05 │ └─────────┴────────────┘ Cross-Entropy Loss: Loss = -log(0.60) = 0.51 If model predicted mat with 0.99: Loss = -log(0.99) = 0.01 (much better!) If model predicted mat with 0.01: Loss = -log(0.01) = 4.61 (terrible!) # PyTorch implementation import torch.nn.functional as F logits = model(input_ids) # [batch, seq_len, vocab_size] loss = F.cross_entropy( logits.view(-1, vocab_size), target_ids.view(-1) ) loss.backward() # Compute gradients optimizer.step() # Update weights

Interactive Exercise

Match Loss to Task

Which loss function would you use for each task?

A. Predicting house prices
B. Spam detection (spam/not spam)
C. Language model next token prediction

Pro Tips
  • Loss landscape visualization reveals optimization difficulty
  • Combining multiple losses (multi-task) requires balancing weights
  • Loss not decreasing? Check learning rate, data, or architecture
  • Validation loss rising while train loss falls = overfitting

Related Terms