Loss Function | HyperKit.ai

Definition

A loss function measures how wrong a model's predictions are compared to the true values. Training minimizes this loss, guiding the model toward better predictions. The choice of loss function shapes what the model learns to optimize.

Loss functions are the feedback signal that enables neural networks to learn from data.

Key Concepts

Differentiable: Must support gradient computation for backpropagation
Task-specific: Different tasks need different loss functions
Minimization: Training aims to reduce loss toward zero
Aggregation: Typically averaged across batch samples

Examples

Common Types

Loss Functions by Task

COMMON LOSS FUNCTIONS:

1. CROSS-ENTROPY (Classification, LLMs)
   Loss = -Σ y_true × log(y_pred)

   Used for: Next token prediction, classification
   Perfect prediction → Loss = 0
   Wrong prediction → Loss → ∞

2. MEAN SQUARED ERROR (Regression)
   Loss = (1/n) × Σ(y_true - y_pred)²

   Used for: Continuous values, embeddings
   Penalizes large errors more heavily

3. BINARY CROSS-ENTROPY (Binary Classification)
   Loss = -[y×log(p) + (1-y)×log(1-p)]

   Used for: Yes/no decisions, multi-label

4. CONTRASTIVE LOSS (Embeddings)
   Pull similar pairs together
   Push different pairs apart

   Used for: Sentence embeddings, retrieval

5. KL DIVERGENCE (Distribution Matching)
   Measures difference between distributions

   Used for: VAEs, knowledge distillation

Example

LLM Training Loss

LANGUAGE MODEL TRAINING:

Input: "The cat sat on the"
Target: "mat"

Model predicts probability distribution:
┌─────────┬────────────┐
│ Token   │ Probability│
├─────────┼────────────┤
│ mat     │ 0.60       │ ← target
│ floor   │ 0.20       │
│ chair   │ 0.10       │
│ dog     │ 0.05       │
│ ...     │ 0.05       │
└─────────┴────────────┘

Cross-Entropy Loss:
Loss = -log(0.60) = 0.51

If model predicted mat with 0.99:
Loss = -log(0.99) = 0.01  (much better!)

If model predicted mat with 0.01:
Loss = -log(0.01) = 4.61  (terrible!)

# PyTorch implementation
import torch.nn.functional as F

logits = model(input_ids)  # [batch, seq_len, vocab_size]
loss = F.cross_entropy(
    logits.view(-1, vocab_size),
    target_ids.view(-1)
)
loss.backward()  # Compute gradients
optimizer.step()  # Update weights

Interactive Exercise

✎

Match Loss to Task

Which loss function would you use for each task?

A. Predicting house prices
B. Spam detection (spam/not spam)
C. Language model next token prediction

Pro Tips

Loss landscape visualization reveals optimization difficulty
Combining multiple losses (multi-task) requires balancing weights
Loss not decreasing? Check learning rate, data, or architecture
Validation loss rising while train loss falls = overfitting

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms