Glossary

Softmax

The Softmax function converts a vector of real-valued scores into a probability distribution. For an input vector $\mathbf{z} = (z_1, \ldots, z_K)$, the softmax output is:

$$\text{softmax}(\mathbf{z})i = \frac{e^{z_i}}{\sum{j=1}^K e^{z_j}}$$

Every output is positive, and the outputs sum to 1. The largest input receives the largest probability, and the ratios between probabilities grow exponentially with the differences between inputs. As the scale of the inputs increases, the softmax becomes increasingly "peaked" around its maximum; as the scale decreases, it approaches a uniform distribution. A temperature parameter $T$ controls this: $\text{softmax}(\mathbf{z}/T)$ is peakier for small $T$ and flatter for large $T$.

Softmax is the standard output activation for multi-class classification, combined with cross-entropy loss. It appears throughout modern deep learning: in the output of language models predicting the next token, in the attention mechanism of transformers (softmax over query-key scores), in reinforcement learning policies (softmax over action logits), and in distillation (softmax with temperature smooths the teacher's distribution). Its differentiability makes it ideal for gradient-based learning, and its probabilistic interpretation aligns naturally with maximum likelihood estimation.

Related terms: Cross-Entropy, Logistic Regression, Attention Mechanism

Discussed in:

Also defined in: Textbook of AI