Cross-Entropy measures the average number of bits needed to encode samples from a true distribution $p$ using a code optimised for a model distribution $q$: $H(p, q) = -\sum_x p(x) \log q(x)$. It decomposes as $H(p, q) = H(p) + D_{KL}(p \parallel q)$, so minimising cross-entropy is equivalent to minimising KL divergence when the true distribution is fixed.
In machine learning, the cross-entropy loss is the standard objective for classification. Given a dataset of examples with true labels, setting $p$ to the empirical distribution (one-hot for each example) and $q$ to the model's predicted probabilities yields:
$$L = -\frac{1}{N}\sum_i \sum_k y_{i,k} \log \hat{p}_{i,k}$$
For binary classification, this reduces to the familiar binary cross-entropy: $-y \log \hat{p} - (1 - y) \log(1 - \hat{p})$. Minimising cross-entropy is equivalent to maximising the log-likelihood of the data under the model, establishing a direct bridge between information theory and maximum likelihood estimation.
Cross-entropy has several advantages over squared-error loss for classification: it gives larger gradients when predictions are badly wrong (accelerating learning), corresponds to proper probabilistic modelling, and works naturally with the softmax output layer. Nearly every modern neural network classifier—from image classifiers to language models predicting the next token—is trained by minimising cross-entropy.
Related terms: Entropy, KL Divergence, Loss Function, Maximum Likelihood Estimation, Softmax
Discussed in:
- Chapter 4: Probability — Information Theory
Also defined in: Textbook of AI