The hinge loss for binary classification with $y \in \{-1, +1\}$ and prediction $\hat y \in \mathbb{R}$ is
$$L_\mathrm{hinge}(y, \hat y) = \max(0, 1 - y \hat y)$$
Zero when $y \hat y \geq 1$ (correct classification with margin $\geq 1$); otherwise grows linearly with the margin violation $1 - y \hat y$.
Hinge loss is the classification loss for the support vector machine. The SVM objective combines hinge loss on training examples with an L2 regulariser:
$$\min_w \frac{1}{N} \sum_n \max(0, 1 - y_n (w^\top x_n + b)) + \frac{\lambda}{2} \|w\|^2$$
This is a convex (but non-smooth) optimisation problem. Equivalent to the soft-margin SVM dual via the Lagrangian.
Properties:
- Convex, global optimisation tractable.
- Non-differentiable at the kink $y \hat y = 1$. Sub-gradient methods (or smoothed approximations) handle this.
- Zero gradient when correct with margin, only margin-violating examples contribute. This produces sparse gradients and is the source of "support vectors".
Multiclass hinge loss (Crammer-Singer, 2001): for true class $y$ among $K$ classes with scores $\hat y_1, \ldots, \hat y_K$,
$$L = \sum_{k \neq y} \max(0, 1 - (\hat y_y - \hat y_k)) = \max(0, 1 - \hat y_y + \max_{k \neq y} \hat y_k)$$
Used in some structured prediction tasks and as an alternative to softmax cross-entropy.
Squared hinge loss $\max(0, 1 - y \hat y)^2$ is differentiable everywhere and gives a smoother optimisation landscape, sometimes preferred over standard hinge.
Comparison to logistic loss:
- Logistic is smoother, gives calibrated probabilities, but never reaches zero gradient.
- Hinge has zero gradient on confident correct predictions, leading to sparse support-vector solutions.
Hinge loss survives in modern AI in:
- SVMs for tabular and text classification (still widely used).
- Triplet loss (a generalisation): $\max(0, m + d(a, p) - d(a, n))$ for anchor-positive-negative triplets in metric learning.
- Margin-based knowledge graph embeddings (TransE, RotatE).
Related terms: Support Vector Machine, SVM (mathematical detail), Cross-Entropy Loss, Triplet Loss
Discussed in:
- Chapter 7: Supervised Learning, Loss Functions