Also known as: log loss, negative log-likelihood, NLL
Cross-entropy is the loss function of every classification model. For a true distribution $p$ over $K$ classes and a predicted distribution $q$ produced by the model, it is
$$H(p, q) = -\sum_{i=1}^K p_i \log q_i.$$
In supervised classification, the true distribution is one-hot, $p_y = 1$ for the correct class $y$, zero elsewhere, and the loss reduces to $-\log q_y$, the negative log-likelihood of the correct class.
Cross-entropy is the natural loss for classification because it is the maximum-likelihood estimator for a categorical distribution. For training data $\{(x_n, y_n)\}_{n=1}^N$ and model $q_\theta(\cdot \mid x)$,
$$\theta^* = \arg\max_\theta \prod_n q_\theta(y_n \mid x_n) = \arg\min_\theta \sum_n -\log q_\theta(y_n \mid x_n).$$
Combined with softmax outputs, cross-entropy yields the clean gradient
$$\frac{\partial L}{\partial z_i} = q_i - p_i$$
where $z$ are the pre-softmax logits, exactly the (signed) error in predicted probability. This is the gradient computation done by every classification loss in PyTorch, TensorFlow and JAX.
Binary cross-entropy for a single $y \in \{0, 1\}$ with predicted $p = \sigma(z)$ is
$$L = -y \log p - (1-y) \log(1-p)$$
which is what nn.BCEWithLogitsLoss computes (with the sigmoid fused for numerical stability via the log-sum-exp trick).
The cross-entropy of a model on a held-out test set is the negative log-likelihood per token that defines language-model perplexity $\exp(L)$, the standard evaluation metric for LMs.
Interactive
Video
Related terms: Softmax, KL Divergence, Shannon Entropy, Maximum Likelihood Estimation, Perplexity
Discussed in:
- Chapter 7: Supervised Learning, Loss Functions