A loss function (also called a cost function, objective function, or error function) quantifies the discrepancy between a model's predictions and the desired outputs. Training a machine learning model is, almost without exception, framed as minimising a loss function with respect to the model's parameters. The choice of loss function encodes what we mean by "good" predictions and has profound consequences for the learned model, it shapes the geometry of the optimisation landscape, determines which kinds of errors are penalised most heavily, and connects the practical task of fitting a model to the statistical task of inferring parameters from data.
Mathematical formulation
Given a model $f_\theta$ with parameters $\theta$, training data $\{(x_i, y_i)\}_{i=1}^n$, and a per-example loss $\ell$, the empirical risk is
$$\mathcal{L}(\theta) = \frac{1}{n} \sum_{i=1}^n \ell(f_\theta(x_i), y_i),$$
and training proceeds by finding $\hat\theta = \arg\min_\theta \mathcal{L}(\theta)$, typically using stochastic gradient descent or a variant such as Adam.
Common losses for regression
For regression tasks, the mean squared error (MSE) $\ell(\hat y, y) = (y - \hat y)^2$ is the workhorse. It penalises large mistakes quadratically and corresponds to maximum likelihood under the assumption that observations are corrupted by zero-mean Gaussian noise of constant variance. Mean absolute error (MAE), $\ell(\hat y, y) = |y - \hat y|$, is more robust to outliers and corresponds to maximum likelihood under a Laplace noise model; it produces estimates of the median rather than the mean of the conditional distribution. The Huber loss combines the two, quadratic near zero, linear in the tails, and is widely used in robust regression and reinforcement learning (where it tames exploding TD errors).
Common losses for classification
For classification, cross-entropy is the loss of choice:
$$\ell(\hat p, y) = -\sum_k y_k \log \hat p_k,$$
where $y$ is the one-hot true label and $\hat p$ the predicted distribution (typically the output of a softmax). Cross-entropy corresponds to maximum likelihood under a Bernoulli or categorical output model and is equivalent to minimising the Kullback–Leibler divergence between the empirical and predicted distributions. Hinge loss, $\ell = \max(0, 1 - y \hat y)$, used by support vector machines, encourages large-margin classifiers. Focal loss, introduced by Lin et al. (2017) for dense object detection, down-weights well-classified examples and addresses extreme class imbalance.
Specialised losses
Modern deep learning has introduced a zoo of task-specific losses: triplet loss and contrastive loss for metric learning; CTC loss for sequence transduction without explicit alignment; perceptual loss comparing deep-network features rather than pixels; adversarial loss in GANs, where the discriminator's output supplies the gradient; and InfoNCE for self-supervised representation learning.
Loss versus metric
A subtle but important distinction separates the loss the optimiser minimises from the evaluation metric the practitioner cares about. A model might be trained with cross-entropy but evaluated with F1, AUC, or accuracy. When these diverge, particularly under class imbalance , the choice of training loss may need adjustment (class weighting, focal loss, or surrogate losses tuned to the metric) to produce a model that performs well on the metric that matters. Many metrics (accuracy, F1, BLEU) are non-differentiable and cannot be optimised directly, which is why differentiable surrogates such as cross-entropy dominate practice.
Video
Related terms: Cross-Entropy Loss, Mean Squared Error, Gradient Descent, Regularisation, Overfitting
Discussed in:
- Chapter 6: ML Fundamentals, Loss Functions