9.9 Loss functions
A loss function is the single most important number in the life of a neural network. It collapses an entire prediction, whether a forecast house price, a probability of pneumonia, or a distribution over a thousand image classes, into one non-negative real number that quantifies how wrong the network was on a particular training example. The whole machinery of training, every gradient descent step, every backpropagated derivative, exists for one purpose: to nudge the parameters of the network in the direction that makes this number smaller. If the loss is the wrong shape for the problem at hand, no amount of clever optimisation will rescue the model. A network trained with squared error on a classification task will, in a literal sense, learn the wrong thing, because the meaning of "wrong" depends entirely on whether the output is a real number, a probability, or a class label.
Section 9.6 introduced backpropagation using a generic symbol $\mathcal{L}$ for the loss, deferring its specific form. Section 9.9 fills in that gap. The choice is not arbitrary. Each canonical loss arises naturally from a probabilistic assumption about the data, and each pairs cleanly with a specific output activation function. Sigmoid output paired with binary cross-entropy is, for all practical purposes, a single unit; softmax output paired with categorical cross-entropy is another. The pairings exist because they yield gradients of the form $\hat{\mathbf{p}} - \mathbf{y}$, a clean, bounded difference between what the network predicted and what the data demanded. Fight the pairing and you fight the gradient itself, slowing or stalling training.
What a loss function is, in plain words
A loss function is a rule that takes a target and a prediction and returns a single non-negative number. The number is zero when the prediction is exactly right and grows as the prediction worsens. Formally, $\mathcal{L} : \mathcal{Y} \times \mathcal{Y} \to \mathbb{R}_{\ge 0}$, where $\mathcal{Y}$ is the space of possible outputs. Training a neural network means finding the parameters $\theta$ that minimise the expected loss over the data distribution. We never see the data distribution directly; we approximate the expectation by averaging the loss over a finite training set, and we minimise this empirical average using gradient descent.
In practice we work in mini-batches. The per-batch loss is the average of the per-example losses across the batch. Suppose a batch of four training examples produces per-example losses of $0.2, 0.5, 0.1, 0.3$. The batch loss is
$$\frac{0.2 + 0.5 + 0.1 + 0.3}{4} = \frac{1.1}{4} = 0.275.$$
The optimiser computes the gradient of this $0.275$ with respect to every parameter in the network and steps the parameters in the direction that reduces it. If the next batch yields a loss of $0.260$, training is making progress; if the loss plateaus or rises, something needs investigation, wrong learning rate, wrong loss, or wrong architecture for the problem.
The choice of loss matters because it defines what the network is rewarded and penalised for. A regression loss says "make the predicted real number as close as possible to the target real number". A classification loss says "make the predicted probability of the true class as close to one as possible". These look superficially similar but produce dramatically different gradients and dramatically different trained models. Two further constraints sharpen the choice. First, the loss should be differentiable almost everywhere, so that backpropagation can compute gradients. Second, the loss should cooperate with the output activation: pairings such as sigmoid + binary cross-entropy and softmax + categorical cross-entropy are not historical accidents but mathematical conveniences that produce especially well-behaved gradients.
Maximum likelihood: where losses come from
Where do the standard losses actually come from? The answer is that almost every loss in common use is the negative log-likelihood of a probabilistic model of the target. This is the single deepest idea in this section, and it unifies what would otherwise look like a zoo of unrelated formulae.
Suppose we believe that the targets in our dataset are samples from a conditional distribution $p(y \mid \mathbf{x}; \theta)$, where $\theta$ collects the parameters of the network. Maximum likelihood seeks the parameters that make the observed data most probable:
$$\hat\theta = \arg\max_\theta \prod_{i=1}^{N} p(y_i \mid \mathbf{x}_i; \theta).$$
Products of many small numbers underflow on a computer, so we take logarithms, which is monotonic, so the maximiser is unchanged, and flip the sign so we have a quantity to minimise:
$$\hat\theta = \arg\min_\theta \; -\sum_{i=1}^{N} \log p(y_i \mid \mathbf{x}_i; \theta).$$
This is the negative log-likelihood. Different assumptions about the form of $p(y \mid \mathbf{x}; \theta)$ produce different losses:
- If $y$ is a real number and we assume $y \mid \mathbf{x} \sim \mathcal{N}(\hat y, \sigma^2)$, a Gaussian centred on the network's prediction with fixed variance, then $\log p(y \mid \mathbf{x}; \theta) = -\frac{(y - \hat y)^2}{2 \sigma^2} + \text{const}$. The negative log-likelihood, dropping the constant and absorbing $\sigma^2$ into the learning rate, is the squared-error loss.
- If $y \in \{0, 1\}$ and we assume $y \sim \text{Bernoulli}(\hat p)$ where $\hat p = \sigma(z)$ is a sigmoid of the network's logit, then $\log p(y) = y \log \hat p + (1-y) \log (1 - \hat p)$. The negative is binary cross-entropy.
- If $y$ is one of $K$ classes and we assume $y \sim \text{Categorical}(\hat{\mathbf{p}})$ with $\hat{\mathbf{p}} = \mathrm{softmax}(\mathbf{z})$, then $\log p(y = c) = \log \hat p_c$. The negative is categorical cross-entropy.
Three different distributions on $y$, three different losses, but one common origin. The maximum-likelihood lens has two practical consequences. First, any loss can be reverse-engineered to expose the probabilistic assumption it implies, which is a useful sanity check: if your problem has heavy-tailed errors, a Gaussian assumption (and therefore squared error) will be misleading. Second, the lens suggests how to invent a loss for a new situation, write down what you believe about the distribution of the target, take the negative log-likelihood, and you have a loss that is consistent with your beliefs. The same machinery extends to entire generative models: variational autoencoders, normalising flows, and diffusion models all train by minimising particular negative log-likelihoods (or bounds thereon).
Mean squared error (regression)
Mean squared error, MSE, is the workhorse loss for regression, predicting a continuous number. For a single example with target $y$ and prediction $\hat y$,
$$\mathcal{L}_{\text{MSE}}(y, \hat y) = (y - \hat y)^2.$$
For a batch of $N$ examples, the mean squared error is the average of the per-example squared errors,
$$\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat y_i)^2.$$
A common variant uses a factor of one half, $\tfrac{1}{2}(y - \hat y)^2$. The half is conventional and slightly cosmetic: differentiating $\tfrac{1}{2}(y - \hat y)^2$ with respect to $\hat y$ gives the clean residual $\hat y - y$ rather than $2(\hat y - y)$. PyTorch's nn.MSELoss omits the half; the discrepancy is absorbed into the learning rate and changes nothing essential.
The probabilistic justification: if we model $y \mid \mathbf{x} \sim \mathcal{N}(\hat y, \sigma^2)$ for fixed $\sigma$, the log-likelihood of one observation is
$$\log p(y \mid \mathbf{x}) = -\frac{(y - \hat y)^2}{2\sigma^2} - \frac{1}{2}\log(2\pi\sigma^2).$$
The second term does not depend on $\theta$, so minimising negative log-likelihood is exactly equivalent to minimising $(y - \hat y)^2$. MSE silently assumes Gaussian residuals.
Worked example. Suppose a network predicts house prices in millions of pounds. For three test examples,
| $i$ | target $y_i$ | prediction $\hat y_i$ | residual $e_i$ | $e_i^2$ |
|---|---|---|---|---|
| 1 | 2.0 | 2.1 | $-0.1$ | 0.01 |
| 2 | 1.0 | 0.5 | $\phantom{-}0.5$ | 0.25 |
| 3 | 0.0 | $-0.3$ | $\phantom{-}0.3$ | 0.09 |
The mean squared error over this batch is
$$\mathcal{L}_{\text{MSE}} = \frac{0.01 + 0.25 + 0.09}{3} = \frac{0.35}{3} \approx 0.1167.$$
Properties. MSE is smooth, convex in $\hat y$, and trivially differentiable. Its gradient with respect to $\hat y$ is $2(\hat y - y)$, or simply $\hat y - y$ in the half-version. These are the cleanest gradients in machine learning, which is why MSE remains popular even where its Gaussian assumption is dubious. Its principal weakness is sensitivity to outliers: because the loss grows quadratically, a single example with residual $100$ contributes $10{,}000$ to the sum, whereas an example with residual $1$ contributes only $1$. One bad data point can dominate the entire training signal. If your data is contaminated by occasional gross errors, instrument failures, data-entry mistakes, fat-tailed phenomena, MSE will chase those outliers at the expense of fitting the bulk of the data.
Mean absolute error (regression, robust)
Mean absolute error, MAE, replaces the square with an absolute value:
$$\mathcal{L}_{\text{MAE}}(y, \hat y) = |y - \hat y|, \qquad \mathcal{L}_{\text{MAE}} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat y_i|.$$
The probabilistic interpretation is that $y - \hat y$ follows a Laplace distribution rather than a Gaussian; the Laplace has heavier tails, so the model "expects" occasional large residuals and is not surprised by them. MAE's gradient is $\pm 1$ regardless of the magnitude of the residual, so an outlier with residual $100$ contributes the same gradient as one with residual $1$. The loss is therefore robust: outliers cannot hijack training.
The cost of robustness is a non-differentiable point at $\hat y = y$. In practice, libraries use a sub-gradient of zero at that point, and modern optimisers handle it without trouble.
Worked example. Using the same predictions and targets as for MSE,
$$\mathcal{L}_{\text{MAE}} = \frac{|{-0.1}| + |0.5| + |0.3|}{3} = \frac{0.1 + 0.5 + 0.3}{3} = \frac{0.9}{3} = 0.3.$$
Compare the two losses on the same data. MSE penalised the residual of $0.5$ by $0.25$, twenty-five times more than the residual of $0.1$, which contributed $0.01$. MAE penalised them at $0.5$ and $0.1$ respectively, only five times more. MAE distributes the loss more evenly across examples and resists outlier dominance. Use MAE when your data has heavy-tailed errors, anomalies, or when median-style central tendency is more meaningful than mean-style.
Huber loss (regression, hybrid)
Huber loss is a deliberate compromise between MSE and MAE. Given a residual $e = y - \hat y$ and a threshold $\delta > 0$,
$$\mathcal{L}_{\delta}(e) = \begin{cases} \tfrac{1}{2} e^2 & |e| \le \delta \\ \delta\bigl(|e| - \tfrac{1}{2}\delta\bigr) & |e| > \delta. \end{cases}$$
The two branches meet smoothly at $|e| = \delta$ in both value and first derivative, so the loss is everywhere differentiable. Near zero it behaves like MSE, quadratic, smooth, with vanishing gradient at the optimum. Far from zero it behaves like MAE, linear, with bounded gradient $\pm \delta$, so outliers cannot dominate. The threshold $\delta$ is a hyperparameter; common choices are $1.0$ for normalised data or a value chosen on validation.
Worked example. Suppose $\delta = 1$ and we have residuals $e_1 = 0.2$, $e_2 = 0.8$, $e_3 = 5.0$. The first two satisfy $|e| \le 1$ and use the quadratic branch:
$$\mathcal{L}_\delta(e_1) = \tfrac{1}{2}(0.2)^2 = 0.02, \qquad \mathcal{L}_\delta(e_2) = \tfrac{1}{2}(0.8)^2 = 0.32.$$
The third residual exceeds the threshold and uses the linear branch:
$$\mathcal{L}_\delta(e_3) = 1 \cdot (5.0 - 0.5) = 4.5.$$
Under pure MSE, $e_3$ would have contributed $\tfrac{1}{2}(5)^2 = 12.5$, nearly three times more. Huber loss is the default for value-function regression in deep reinforcement learning, where occasional very large temporal-difference errors would otherwise destabilise training.
Binary cross-entropy (binary classification)
Binary cross-entropy is the canonical loss for problems where each example belongs to one of two classes, disease present or absent, spam or not, clicked or not. The target is $y \in \{0, 1\}$ and the prediction is a probability $\hat p = \sigma(z) \in (0, 1)$, where $z$ is a real-valued logit produced by the final linear layer and $\sigma$ is the sigmoid. The loss is
$$\mathcal{L}_{\text{BCE}}(y, \hat p) = -\bigl[y \log \hat p + (1 - y) \log(1 - \hat p)\bigr].$$
Only one of the two terms survives for any given example. If $y = 1$, the loss reduces to $-\log \hat p$, which is small when $\hat p$ is near one and grows without bound as $\hat p$ approaches zero. If $y = 0$, the loss reduces to $-\log(1 - \hat p)$, the mirror image.
The probabilistic derivation is immediate. Model $y \sim \text{Bernoulli}(\hat p)$, so $p(y) = \hat p^y (1 - \hat p)^{1 - y}$. Taking the log gives $y \log \hat p + (1-y) \log(1 - \hat p)$, and negating gives the loss above.
Worked example. Suppose the true label is $y = 1$ and the network predicts $\hat p = 0.9$. The loss is
$$\mathcal{L}_{\text{BCE}} = -\log(0.9) \approx 0.1054.$$
A confident, correct prediction earns a small loss. Now suppose the network predicts $\hat p = 0.1$ for the same true label $y = 1$. The loss is
$$\mathcal{L}_{\text{BCE}} = -\log(0.1) \approx 2.3026.$$
A confident, wrong prediction earns a much larger loss, roughly twenty-two times larger. As $\hat p \to 0$ with $y = 1$, the loss diverges to $+\infty$. The network is heavily, asymptotically penalised for being confidently wrong, which is precisely what we want: confidence should be earned by being right.
Pairing with sigmoid output. The reason sigmoid + BCE is taught as a unit is the gradient. With $\hat p = \sigma(z) = 1 / (1 + e^{-z})$,
$$\frac{\partial \mathcal{L}_{\text{BCE}}}{\partial z} = \hat p - y.$$
The sigmoid derivative $\hat p (1 - \hat p)$, which would otherwise vanish for confidently saturated logits, cancels exactly with the $1/\hat p$ term from differentiating $\log \hat p$. The result is a clean residual: predicted probability minus target. There is no awkward $\sigma'(z)$ factor that could shrink to zero and stall learning. If we paired sigmoid with squared error instead, the gradient would carry that vanishing factor and training would crawl when the logit saturated. The pairing is mathematical convenience hardened into convention.
For numerical stability, libraries compute BCE directly from the logit $z$ rather than first forming $\hat p$:
$$\mathcal{L}_{\text{BCE}}(y, z) = \max(z, 0) - z y + \log\bigl(1 + e^{-|z|}\bigr).$$
This identity, used by F.binary_cross_entropy_with_logits in PyTorch, avoids the overflow that would arise from computing $e^{z}$ when $|z|$ is large. Always prefer the "with logits" function over computing the sigmoid first and feeding it to a separate loss.
Categorical cross-entropy (multi-class classification)
Categorical cross-entropy generalises BCE to $K$-way classification, distinguishing cats from dogs from horses, or one of a thousand ImageNet classes. The target is a one-hot vector $\mathbf{y} \in \{0, 1\}^K$ with $\sum_k y_k = 1$, and the prediction is a probability vector $\hat{\mathbf{p}} = \mathrm{softmax}(\mathbf{z}) \in \Delta^{K-1}$, where $\mathbf{z}$ is the vector of logits and the softmax is
$$\hat p_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}.$$
The loss is
$$\mathcal{L}_{\text{CE}}(\mathbf{y}, \hat{\mathbf{p}}) = -\sum_{k=1}^{K} y_k \log \hat p_k.$$
Because the target is one-hot, only one term in the sum survives. If the true class is $c$, then $y_c = 1$ and all other $y_k = 0$, so the loss collapses to
$$\mathcal{L}_{\text{CE}} = -\log \hat p_c.$$
Categorical cross-entropy is binary cross-entropy's generalisation: it cares only about the predicted probability of the correct class. The probabilistic derivation: model $y \sim \text{Categorical}(\hat{\mathbf{p}})$, so $p(y = c) = \hat p_c$, and the negative log-likelihood is $-\log \hat p_c$, the formula above.
Worked example. Three classes, true class $c = 1$, so $\mathbf{y} = (1, 0, 0)$. Suppose the predicted distribution is $\hat{\mathbf{p}} = (0.8, 0.15, 0.05)$. The loss is
$$\mathcal{L}_{\text{CE}} = -\log(0.8) \approx 0.2231.$$
Now suppose the network is uncertain and assigns $\hat{\mathbf{p}} = (0.1, 0.45, 0.45)$ for the same true label. The loss jumps to
$$\mathcal{L}_{\text{CE}} = -\log(0.1) \approx 2.3026,$$
ten times larger. The penalty depends only on the probability mass placed on the true class; how the remaining mass is distributed across the wrong classes is irrelevant. This is sometimes counter-intuitive, being slightly wrong about which wrong class is most likely makes no difference, but it is the correct behaviour given the maximum-likelihood derivation.
Pairing with softmax output. As with sigmoid + BCE, the elegance of softmax + CE is in the gradient. With $\hat{\mathbf{p}} = \mathrm{softmax}(\mathbf{z})$, one can show
$$\frac{\partial \mathcal{L}_{\text{CE}}}{\partial z_i} = \hat p_i - y_i.$$
The full vector form is $\nabla_{\mathbf{z}} \mathcal{L}_{\text{CE}} = \hat{\mathbf{p}} - \mathbf{y}$. The derivation, sketched: with $\partial \hat p_k / \partial z_i = \hat p_k (\delta_{ki} - \hat p_i)$ where $\delta_{ki}$ is the Kronecker delta,
$$\frac{\partial \mathcal{L}_{\text{CE}}}{\partial z_i} = -\sum_k \frac{y_k}{\hat p_k} \cdot \hat p_k (\delta_{ki} - \hat p_i) = -y_i + \hat p_i \sum_k y_k = \hat p_i - y_i,$$
using $\sum_k y_k = 1$. The same cancellation as the binary case: the softmax derivative cancels exactly with the $1/\hat p_k$ from differentiating the log. The gradient is a clean, bounded vector pointing from the predicted distribution toward the one-hot target. As with BCE, modern libraries combine softmax and CE into a single numerically stable function, nn.CrossEntropyLoss in PyTorch takes raw logits, not softmaxed probabilities, to avoid overflow when logits are large. The combination softmax + CE is the universal default for multi-class single-label classification.
Hinge loss (SVM-style classification)
Hinge loss originates in support vector machines. With binary target $y \in \{-1, +1\}$, note the sign convention, not $\{0, 1\}$, and a real-valued score $\hat y$,
$$\mathcal{L}_{\text{hinge}}(y, \hat y) = \max(0, 1 - y \hat y).$$
If $y \hat y \ge 1$, the example is classified correctly with a margin of at least one and the loss is zero, the optimiser ignores it. If $y \hat y < 1$, the loss is $1 - y \hat y$ and grows linearly. Hinge loss is therefore a margin loss: it pushes correct predictions to be confident, but stops pushing once they are confident enough.
Worked example. With $y = +1$ and $\hat y = 1.5$, $y \hat y = 1.5 \ge 1$, so $\mathcal{L} = 0$. With $y = +1$ and $\hat y = 0.3$, $y \hat y = 0.3$, so $\mathcal{L} = 1 - 0.3 = 0.7$. With $y = +1$ and $\hat y = -0.5$, the prediction is on the wrong side of the boundary and $\mathcal{L} = 1 - (-0.5) = 1.5$.
The (sub)gradient is $-y$ when $y \hat y < 1$ and $0$ otherwise, so only currently-misclassified or low-margin examples contribute to weight updates, a sparseness property prized in classical SVMs. Hinge loss is rarely the default for deep nets but appears in structured prediction, certain ranking problems, and as a regulariser. It is not a probabilistic loss: there is no maximum-likelihood derivation, which means it does not produce calibrated probabilities.
Other losses you'll encounter
KL divergence. $D_{\text{KL}}(\mathbf{q} \,\|\, \hat{\mathbf{p}}) = \sum_k q_k \log(q_k / \hat p_k)$ measures how one distribution differs from another. It equals cross-entropy minus the entropy of $\mathbf{q}$, and since the entropy term does not depend on $\theta$, minimising KL is equivalent to minimising cross-entropy. KL appears explicitly in distillation (matching a student to a teacher distribution), label smoothing, and the variational lower bound of VAEs.
InfoNCE / contrastive loss. $\mathcal{L}_{\text{InfoNCE}} = -\log \frac{\exp(s(\mathbf{u}, \mathbf{v}^+) / \tau)}{\sum_j \exp(s(\mathbf{u}, \mathbf{v}_j) / \tau)}$, where $\mathbf{v}^+$ is the positive pair, the sum runs over the positive plus a batch of negatives, $s$ is a similarity (cosine or dot product), and $\tau$ is a temperature. It is just cross-entropy with a softmax denominator built from negative samples; it pulls the anchor toward its positive and pushes away from its negatives. Used in self-supervised learning (SimCLR, CLIP).
Triplet loss. $\mathcal{L} = \max(0, \|\mathbf{a} - \mathbf{p}\|^2 - \|\mathbf{a} - \mathbf{n}\|^2 + \alpha)$ for an anchor $\mathbf{a}$, positive $\mathbf{p}$, and negative $\mathbf{n}$, with margin $\alpha$. Used in face-recognition embeddings such as FaceNet.
Focal loss. $-(1 - \hat p_y)^\gamma \log \hat p_y$ where $\hat p_y$ is the predicted probability of the true class and $\gamma > 0$. The $(1 - \hat p_y)^\gamma$ factor downweights easy examples (where $\hat p_y$ is already near one) so the model focuses on hard examples; central to dense object detection (RetinaNet) where most candidate boxes are background.
Wasserstein / earth-mover. Used in WGANs as a substitute for the Jensen-Shannon divergence implicit in the original GAN formulation; produces meaningful gradients even when generator and target distributions have non-overlapping supports.
Custom multi-task losses. A weighted sum $\sum_t w_t \mathcal{L}_t$ over per-task losses, with weights $w_t$ tuned manually or learned (e.g. by uncertainty weighting). Multi-task models routinely combine cross-entropy for classification, MSE for regression, and reconstruction terms for self-supervision in a single composite loss.
Choosing a loss
The decision is largely determined by the data type of the target and the assumptions you are willing to make about the noise:
- Regression with continuous target, roughly Gaussian errors. Use mean squared error. Cleanest gradients, simplest analysis, near-universal default.
- Regression with outliers or heavy-tailed errors. Use mean absolute error or Huber loss. Huber is usually the better engineering compromise: differentiable and outlier-robust.
- Binary classification. Use binary cross-entropy with a sigmoid output, computed via the "with logits" variant for numerical stability.
- Multi-class single-label classification. Use categorical cross-entropy with a softmax output, again via the combined logits-aware function. This is the universal default for ImageNet-style problems.
- Multi-class multi-label classification (each class is an independent yes/no, e.g. tagging an image with multiple objects). Use a sum of per-class binary cross-entropies with a sigmoid output per class. Do not use softmax: it forces the probabilities to sum to one, which is wrong when multiple classes can be simultaneously present.
- Class imbalance. Use focal loss, or class-weighted cross-entropy, or oversample the minority class. Plain cross-entropy on a 1:1000 imbalance will collapse to the constant majority prediction.
- Self-supervised or metric learning. Use a contrastive loss (InfoNCE) or a triplet loss with a margin.
- Generative modelling. The loss is dictated by the model: KL plus reconstruction for VAEs, exact log-likelihood for autoregressive models and normalising flows, denoising score-matching for diffusion, Wasserstein or Jensen-Shannon for GAN variants.
If in doubt, start from the maximum-likelihood principle. Write down what you believe about the distribution of the target conditional on the input, take the negative log-likelihood, and the loss falls out. This is more reliable than picking a loss from a list.
What you should take away
- A loss function is a single non-negative number that summarises how wrong a prediction is on one example or batch; training is the act of nudging the parameters to make this number smaller.
- Almost every standard loss is the negative log-likelihood of a probabilistic model, Gaussian gives squared error, Bernoulli gives binary cross-entropy, categorical gives cross-entropy.
- Output activation and loss are paired for clean gradients. Sigmoid + binary cross-entropy and softmax + categorical cross-entropy both yield the same simple form $\hat{\mathbf{p}} - \mathbf{y}$, with no vanishing activation derivative to stall learning.
- Mean squared error is sensitive to outliers because the penalty grows quadratically; mean absolute error and Huber loss are robust alternatives that bound the gradient at $\pm \delta$.
- Multi-class single-label classification universally uses softmax + cross-entropy on logits; multi-label classification uses per-class sigmoid + binary cross-entropy. Mixing these up produces silently wrong models.
- The shape of the loss landscape, flat regions, sharp valleys, saddles, determines what training feels like in practice. Loss choice, output activation, parameterisation, and initialisation all conspire to make that landscape easier or harder to traverse.