Softmax

Math

A function that turns a vector of real numbers into a probability distribution — exponentiate each, divide by the sum.


In one line

softmax(x)_i = exp(x_i) / sum_j exp(x_j) — turn raw scores into probabilities that sum to 1.

What it actually means

Softmax takes a vector of “logits” (unnormalized scores) and produces a probability distribution over the same indices. It’s the standard output layer for multi-class classification and the normalization step inside attention (softmax(Q K^T / sqrt(d))). Because exp grows fast, softmax is strongly peaked around the largest logit — a small gap between the top two logits becomes a large gap in probabilities. Dividing logits by a temperature T before softmax controls the sharpness: T < 1 sharpens, T > 1 flattens.

Why it matters

Anywhere a model needs to output “which of these” — classification heads, attention weights, token sampling in LLMs — softmax is there. Numerical stability matters: compute softmax by subtracting the max logit before exp, otherwise you’ll hit overflow. Log-softmax exists for the same reason.

Example

import torch.nn.functional as F
probs = F.softmax(logits / temperature, dim=-1)
logp  = F.log_softmax(logits, dim=-1)  # use this with NLL loss

You’ll hear it when

  • Writing a classifier head.
  • Implementing attention from scratch.
  • Discussing temperature in LLM sampling.
  • Debugging NaNs in a loss curve.

Related terms