The softmax function maps a vector of $K$ real-valued logits $z = (z_1, \ldots, z_K) \in \mathbb{R}^K$ to a probability distribution over $K$ classes:
$$\mathrm{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$$
The output satisfies $\sum_i \mathrm{softmax}(z)_i = 1$ and each component lies in $(0, 1)$.
Softmax is the standard final layer of a multi-class classifier and the heart of the attention mechanism $\mathrm{softmax}(QK^\top / \sqrt{d_k}) V$ that powers every Transformer. It is the natural generalisation of sigmoid: for $K=2$ classes, softmax with logits $(z_1, z_2)$ produces the same probability as sigmoid applied to $z_1 - z_2$.
The Jacobian is
$$\frac{\partial \mathrm{softmax}(z)_i}{\partial z_j} = \mathrm{softmax}(z)_i (\delta_{ij} - \mathrm{softmax}(z)_j)$$
where $\delta_{ij}$ is the Kronecker delta. Combined with cross-entropy loss, this gives the clean gradient $\partial L / \partial z_i = p_i - y_i$ where $p$ is the predicted distribution and $y$ the one-hot target, the simplicity that makes softmax+cross-entropy the canonical classification objective.
Numerical stability requires the log-sum-exp trick: rather than compute $e^{z_i}$ directly (which overflows for large $z_i$), subtract the maximum first
$$\mathrm{softmax}(z)_i = \frac{e^{z_i - \max_k z_k}}{\sum_j e^{z_j - \max_k z_k}}$$
Modern implementations use this throughout. Temperature scaling $\mathrm{softmax}(z/T)$ controls the sharpness: $T \to 0$ approaches the argmax (one-hot); $T \to \infty$ approaches the uniform distribution. Temperature is the standard sampling control in language-model decoding.
Mathematics
For logits $z \in \mathbb{R}^K$, softmax produces a distribution
$$\mathrm{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}.$$
The Jacobian is
$$\frac{\partial \mathrm{softmax}(z)_i}{\partial z_j} = \mathrm{softmax}(z)_i \, (\delta_{ij} - \mathrm{softmax}(z)_j).$$
Combined with cross-entropy loss $\mathcal{L} = -\sum_i y_i \log p_i$ where $p = \mathrm{softmax}(z)$ and $y$ is a one-hot target, the gradient simplifies dramatically:
$$\frac{\partial \mathcal{L}}{\partial z_i} = p_i - y_i.$$
This is why softmax + cross-entropy is the canonical classification objective, the gradient is simply the prediction error, with no need for explicit Jacobian computation.
Numerical stability via the log-sum-exp trick: subtract $\max_k z_k$ inside the exponential to keep values in a stable range, exploiting the invariance $\mathrm{softmax}(z + c) = \mathrm{softmax}(z)$ for any constant $c$:
$$\mathrm{softmax}(z)_i = \frac{e^{z_i - \max_k z_k}}{\sum_j e^{z_j - \max_k z_k}}.$$
Temperature $T > 0$ controls sharpness: $\mathrm{softmax}(z/T)$ approaches the one-hot $\arg\max$ as $T \to 0$ and approaches the uniform distribution as $T \to \infty$. The standard sampling temperature in language-model decoding.
Interactive
Video
Related terms: Sigmoid Function, Cross-Entropy Loss, Attention Mechanism
Discussed in:
- Chapter 9: Neural Networks, Activation Functions