Cross-Entropy Loss, Glossary, Textbook of AI

Also known as: log loss, negative log-likelihood, NLL

Cross-entropy is the loss function of every classification model. For a true distribution $p$ over $K$ classes and a predicted distribution $q$ produced by the model, it is

$$H(p, q) = -\sum_{i=1}^K p_i \log q_i.$$

In supervised classification, the true distribution is one-hot, $p_y = 1$ for the correct class $y$, zero elsewhere, and the loss reduces to $-\log q_y$, the negative log-likelihood of the correct class.

Cross-entropy is the natural loss for classification because it is the maximum-likelihood estimator for a categorical distribution. For training data $\{(x_n, y_n)\}_{n=1}^N$ and model $q_\theta(\cdot \mid x)$,

$$\theta^* = \arg\max_\theta \prod_n q_\theta(y_n \mid x_n) = \arg\min_\theta \sum_n -\log q_\theta(y_n \mid x_n).$$

Combined with softmax outputs, cross-entropy yields the clean gradient

$$\frac{\partial L}{\partial z_i} = q_i - p_i$$

where $z$ are the pre-softmax logits, exactly the (signed) error in predicted probability. This is the gradient computation done by every classification loss in PyTorch, TensorFlow and JAX.

Binary cross-entropy for a single $y \in \{0, 1\}$ with predicted $p = \sigma(z)$ is

$$L = -y \log p - (1-y) \log(1-p)$$

which is what nn.BCEWithLogitsLoss computes (with the sigmoid fused for numerical stability via the log-sum-exp trick).

The cross-entropy of a model on a held-out test set is the negative log-likelihood per token that defines language-model perplexity $\exp(L)$, the standard evaluation metric for LMs.

Interactive

Logistic regression finds a boundary. A separating line learns its place by minimising cross-entropy on labelled points.

Video

Discussed in:

Chapter 7: Supervised Learning, Loss Functions

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.