Softmax | HyperKit.ai

Definition

Softmax is a mathematical function that converts a vector of raw scores (logits) into a probability distribution. It exponentiates each value and normalizes so all outputs sum to 1, making them interpretable as probabilities.

In language models, softmax converts the model's confidence scores into token probabilities for sampling.

Key Concepts

Normalization: Outputs always sum to exactly 1.0
Amplification: Exponentiation widens gaps between values
Differentiable: Enables gradient-based training
Temperature scaling: Softmax(x/T) controls sharpness

Examples

Formula

Softmax Computation

SOFTMAX FORMULA:
softmax(xᵢ) = exp(xᵢ) / Σⱼ exp(xⱼ)

EXAMPLE:
Logits: [2.0, 1.0, 0.1]

Step 1: Exponentiate each value
exp(2.0) = 7.39
exp(1.0) = 2.72
exp(0.1) = 1.11
Sum = 11.22

Step 2: Normalize (divide by sum)
7.39 / 11.22 = 0.66 (66%)
2.72 / 11.22 = 0.24 (24%)
1.11 / 11.22 = 0.10 (10%)
                ────
         Total: 1.00 (100%)

Probabilities: [0.66, 0.24, 0.10]

Note: Input [2.0, 1.0, 0.1] becomes much more skewed
after softmax due to exponential amplification!

Temperature Effect

Softmax with Temperature

Logits: [2.0, 1.0, 0.5]

TEMPERATURE = 1.0 (standard):
softmax([2.0, 1.0, 0.5]) = [0.59, 0.24, 0.17]

TEMPERATURE = 0.5 (sharper, more confident):
softmax([2.0, 1.0, 0.5] / 0.5) = softmax([4.0, 2.0, 1.0])
                                = [0.84, 0.11, 0.04]

TEMPERATURE = 2.0 (flatter, more uniform):
softmax([2.0, 1.0, 0.5] / 2.0) = softmax([1.0, 0.5, 0.25])
                                = [0.43, 0.26, 0.20]

TEMPERATURE → 0 (approaches one-hot/argmax):
                                = [1.0, 0.0, 0.0]

TEMPERATURE → ∞ (approaches uniform):
                                = [0.33, 0.33, 0.33]

Temperature controls the "softness" of softmax!

Interactive Exercise

✎

Predict Softmax Output

Given logits [3.0, 3.0, 3.0], what will softmax output?

Hint: What happens when all inputs are equal?

Pro Tips

Softmax is numerically stable when you subtract max(logits) first
Equal logits → equal probabilities (uniform distribution)
Adding constant to all logits doesn't change softmax output
Log-softmax is more numerically stable for loss computation

Definition

Key Concepts

Examples

Interactive Exercise

Related Terms