Softmax
A function that turns a vector of real numbers into a probability distribution — exponentiate each, divide by the sum.
In one line
softmax(x)_i = exp(x_i) / sum_j exp(x_j) — turn raw scores into probabilities that sum to 1.
What it actually means
Softmax takes a vector of “logits” (unnormalized scores) and produces a probability distribution over the same indices. It’s the standard output layer for multi-class classification and the normalization step inside attention (softmax(Q K^T / sqrt(d))). Because exp grows fast, softmax is strongly peaked around the largest logit — a small gap between the top two logits becomes a large gap in probabilities. Dividing logits by a temperature T before softmax controls the sharpness: T < 1 sharpens, T > 1 flattens.
Why it matters
Anywhere a model needs to output “which of these” — classification heads, attention weights, token sampling in LLMs — softmax is there. Numerical stability matters: compute softmax by subtracting the max logit before exp, otherwise you’ll hit overflow. Log-softmax exists for the same reason.
Example
import torch.nn.functional as F
probs = F.softmax(logits / temperature, dim=-1)
logp = F.log_softmax(logits, dim=-1) # use this with NLL loss
You’ll hear it when
- Writing a classifier head.
- Implementing attention from scratch.
- Discussing temperature in LLM sampling.
- Debugging NaNs in a loss curve.