Hinge Loss, Glossary, Textbook of AI

The hinge loss for binary classification with $y \in \{-1, +1\}$ and prediction $\hat y \in \mathbb{R}$ is

$$L_\mathrm{hinge}(y, \hat y) = \max(0, 1 - y \hat y)$$

Zero when $y \hat y \geq 1$ (correct classification with margin $\geq 1$); otherwise grows linearly with the margin violation $1 - y \hat y$.

Hinge loss is the classification loss for the support vector machine. The SVM objective combines hinge loss on training examples with an L2 regulariser:

$$\min_w \frac{1}{N} \sum_n \max(0, 1 - y_n (w^\top x_n + b)) + \frac{\lambda}{2} \|w\|^2$$

This is a convex (but non-smooth) optimisation problem. Equivalent to the soft-margin SVM dual via the Lagrangian.

Properties:

Convex, global optimisation tractable.
Non-differentiable at the kink $y \hat y = 1$. Sub-gradient methods (or smoothed approximations) handle this.
Zero gradient when correct with margin, only margin-violating examples contribute. This produces sparse gradients and is the source of "support vectors".

Multiclass hinge loss (Crammer-Singer, 2001): for true class $y$ among $K$ classes with scores $\hat y_1, \ldots, \hat y_K$,

$$L = \sum_{k \neq y} \max(0, 1 - (\hat y_y - \hat y_k)) = \max(0, 1 - \hat y_y + \max_{k \neq y} \hat y_k)$$

Used in some structured prediction tasks and as an alternative to softmax cross-entropy.

Squared hinge loss $\max(0, 1 - y \hat y)^2$ is differentiable everywhere and gives a smoother optimisation landscape, sometimes preferred over standard hinge.

Comparison to logistic loss:

Logistic is smoother, gives calibrated probabilities, but never reaches zero gradient.
Hinge has zero gradient on confident correct predictions, leading to sparse support-vector solutions.

Hinge loss survives in modern AI in:

SVMs for tabular and text classification (still widely used).
Triplet loss (a generalisation): $\max(0, m + d(a, p) - d(a, n))$ for anchor-positive-negative triplets in metric learning.
Margin-based knowledge graph embeddings (TransE, RotatE).

Discussed in:

Chapter 7: Supervised Learning, Loss Functions

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).