Also known as: L2 loss, squared loss, quadratic loss
Mean squared error (MSE) is the standard regression loss:
$$L_{\mathrm{MSE}} = \frac{1}{N} \sum_{n=1}^N (y_n - \hat y_n)^2$$
where $y_n$ is the target and $\hat y_n$ the prediction. The factor of $\tfrac{1}{N}$ averages over the dataset; some formulations use $\tfrac{1}{2N}$ so the gradient is exactly the residual.
MSE is the maximum likelihood loss for Gaussian noise: if $y_n = f(x_n; \theta) + \epsilon$ with $\epsilon \sim \mathcal{N}(0, \sigma^2)$, then maximum likelihood estimation of $\theta$ is equivalent to minimising MSE. This justifies MSE as a principled choice whenever the noise is approximately Gaussian.
The gradient with respect to predictions is
$$\frac{\partial L_{\mathrm{MSE}}}{\partial \hat y_n} = \frac{2}{N} (\hat y_n - y_n)$$
, the prediction error itself, scaled. This simple gradient is one of the reasons MSE is so widely used.
MSE has well-known weaknesses: Sensitive to outliers, squaring magnifies large errors, so a few outliers can dominate the gradient. Mean absolute error (MAE / L1) $\frac{1}{N} \sum |y - \hat y|$ is more robust but has discontinuous derivative at zero. Huber loss combines MSE near zero with MAE for large errors and is the standard robust regression loss. Equal weighting across magnitudes, for targets spanning many orders of magnitude (e.g. drug concentrations, prices, populations), MSE is dominated by the largest values. Mean squared log error or modelling $\log y$ rather than $y$ are common remedies.
Variants: Root mean squared error (RMSE) is $\sqrt{\mathrm{MSE}}$, scaling-equivalent and directly interpretable in the units of $y$. R² (coefficient of determination) compares MSE to the variance of $y$, giving a normalised goodness-of-fit. Quantile loss trains models to predict specific quantiles rather than means.
Video
Related terms: Cross-Entropy Loss, Maximum Likelihood Estimation
Discussed in:
- Chapter 7: Supervised Learning, Loss Functions