Softmax, Glossary, Textbook of AI

The softmax function maps a vector of $K$ real-valued logits $z = (z_1, \ldots, z_K) \in \mathbb{R}^K$ to a probability distribution over $K$ classes:

$$\mathrm{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$$

The output satisfies $\sum_i \mathrm{softmax}(z)_i = 1$ and each component lies in $(0, 1)$.

Softmax is the standard final layer of a multi-class classifier and the heart of the attention mechanism $\mathrm{softmax}(QK^\top / \sqrt{d_k}) V$ that powers every Transformer. It is the natural generalisation of sigmoid: for $K=2$ classes, softmax with logits $(z_1, z_2)$ produces the same probability as sigmoid applied to $z_1 - z_2$.

The Jacobian is

$$\frac{\partial \mathrm{softmax}(z)_i}{\partial z_j} = \mathrm{softmax}(z)_i (\delta_{ij} - \mathrm{softmax}(z)_j)$$

where $\delta_{ij}$ is the Kronecker delta. Combined with cross-entropy loss, this gives the clean gradient $\partial L / \partial z_i = p_i - y_i$ where $p$ is the predicted distribution and $y$ the one-hot target, the simplicity that makes softmax+cross-entropy the canonical classification objective.

Numerical stability requires the log-sum-exp trick: rather than compute $e^{z_i}$ directly (which overflows for large $z_i$), subtract the maximum first

$$\mathrm{softmax}(z)_i = \frac{e^{z_i - \max_k z_k}}{\sum_j e^{z_j - \max_k z_k}}$$

Modern implementations use this throughout. Temperature scaling $\mathrm{softmax}(z/T)$ controls the sharpness: $T \to 0$ approaches the argmax (one-hot); $T \to \infty$ approaches the uniform distribution. Temperature is the standard sampling control in language-model decoding.

Mathematics

For logits $z \in \mathbb{R}^K$, softmax produces a distribution

$$\mathrm{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}.$$

The Jacobian is

$$\frac{\partial \mathrm{softmax}(z)_i}{\partial z_j} = \mathrm{softmax}(z)_i \, (\delta_{ij} - \mathrm{softmax}(z)_j).$$

Combined with cross-entropy loss $\mathcal{L} = -\sum_i y_i \log p_i$ where $p = \mathrm{softmax}(z)$ and $y$ is a one-hot target, the gradient simplifies dramatically:

$$\frac{\partial \mathcal{L}}{\partial z_i} = p_i - y_i.$$

This is why softmax + cross-entropy is the canonical classification objective, the gradient is simply the prediction error, with no need for explicit Jacobian computation.

Numerical stability via the log-sum-exp trick: subtract $\max_k z_k$ inside the exponential to keep values in a stable range, exploiting the invariance $\mathrm{softmax}(z + c) = \mathrm{softmax}(z)$ for any constant $c$:

$$\mathrm{softmax}(z)_i = \frac{e^{z_i - \max_k z_k}}{\sum_j e^{z_j - \max_k z_k}}.$$

Temperature $T > 0$ controls sharpness: $\mathrm{softmax}(z/T)$ approaches the one-hot $\arg\max$ as $T \to 0$ and approaches the uniform distribution as $T \to \infty$. The standard sampling temperature in language-model decoding.

Interactive

Self-attention as Q–K–V dot products. Query, key and value vectors produce an attention matrix over four tokens.

Video

Related terms: Sigmoid Function, Cross-Entropy Loss, Attention Mechanism

Discussed in:

Chapter 9: Neural Networks, Activation Functions

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.