Model Internals / Mathematical Functions

Softmax

Intermediate [3/5]
Softmax function Normalized exponential Soft argmax

Definition

Softmax is a mathematical function that converts a vector of raw scores (logits) into a probability distribution. It exponentiates each value and normalizes so all outputs sum to 1, making them interpretable as probabilities.

In language models, softmax converts the model's confidence scores into token probabilities for sampling.

Key Concepts

  • Normalization: Outputs always sum to exactly 1.0
  • Amplification: Exponentiation widens gaps between values
  • Differentiable: Enables gradient-based training
  • Temperature scaling: Softmax(x/T) controls sharpness

Examples

Formula
Softmax Computation
SOFTMAX FORMULA: softmax(xᵢ) = exp(xᵢ) / Σⱼ exp(xⱼ) EXAMPLE: Logits: [2.0, 1.0, 0.1] Step 1: Exponentiate each value exp(2.0) = 7.39 exp(1.0) = 2.72 exp(0.1) = 1.11 Sum = 11.22 Step 2: Normalize (divide by sum) 7.39 / 11.22 = 0.66 (66%) 2.72 / 11.22 = 0.24 (24%) 1.11 / 11.22 = 0.10 (10%) ──── Total: 1.00 (100%) Probabilities: [0.66, 0.24, 0.10] Note: Input [2.0, 1.0, 0.1] becomes much more skewed after softmax due to exponential amplification!
Temperature Effect
Softmax with Temperature
Logits: [2.0, 1.0, 0.5] TEMPERATURE = 1.0 (standard): softmax([2.0, 1.0, 0.5]) = [0.59, 0.24, 0.17] TEMPERATURE = 0.5 (sharper, more confident): softmax([2.0, 1.0, 0.5] / 0.5) = softmax([4.0, 2.0, 1.0]) = [0.84, 0.11, 0.04] TEMPERATURE = 2.0 (flatter, more uniform): softmax([2.0, 1.0, 0.5] / 2.0) = softmax([1.0, 0.5, 0.25]) = [0.43, 0.26, 0.20] TEMPERATURE → 0 (approaches one-hot/argmax): = [1.0, 0.0, 0.0] TEMPERATURE → ∞ (approaches uniform): = [0.33, 0.33, 0.33] Temperature controls the "softness" of softmax!

Interactive Exercise

Predict Softmax Output

Given logits [3.0, 3.0, 3.0], what will softmax output?

Hint: What happens when all inputs are equal?

Pro Tips
  • Softmax is numerically stable when you subtract max(logits) first
  • Equal logits → equal probabilities (uniform distribution)
  • Adding constant to all logits doesn't change softmax output
  • Log-softmax is more numerically stable for loss computation

Related Terms