SVM (mathematical detail), Glossary, Textbook of AI

The hard-margin SVM for linearly separable data $\{(x_n, y_n)\}_{n=1}^N$ with $y_n \in \{\pm 1\}$ solves

$$\min_{w, b} \frac{1}{2} \|w\|^2 \quad \text{s.t.} \quad y_n(w^\top x_n + b) \geq 1 \ \forall n$$

The constraints require all points to be correctly classified with margin $\geq 1/\|w\|$ in the appropriate units. Maximising the margin $1/\|w\|$ is equivalent to minimising $\|w\|^2$.

Lagrangian introduces multipliers $\alpha_n \geq 0$:

$$\mathcal{L}(w, b, \alpha) = \frac{1}{2} \|w\|^2 - \sum_n \alpha_n [y_n (w^\top x_n + b) - 1]$$

Setting $\partial \mathcal{L}/\partial w = 0$ gives $w = \sum_n \alpha_n y_n x_n$, and $\partial \mathcal{L}/\partial b = 0$ gives $\sum_n \alpha_n y_n = 0$. Substituting back yields the dual problem:

$$\max_\alpha \sum_n \alpha_n - \frac{1}{2} \sum_{n, m} \alpha_n \alpha_m y_n y_m \, x_n^\top x_m$$

$$\text{s.t.} \quad \alpha_n \geq 0, \quad \sum_n \alpha_n y_n = 0$$

The KKT conditions at optimality:

$\alpha_n [y_n (w^\top x_n + b) - 1] = 0$, complementary slackness.

Points with $\alpha_n > 0$ are exactly those on the margin, the support vectors. All other points have $\alpha_n = 0$ and don't affect the decision boundary.

Soft-margin SVM (Cortes & Vapnik 1995) introduces slack variables $\xi_n \geq 0$ for non-separable data:

$$\min_{w, b, \xi} \frac{1}{2} \|w\|^2 + C \sum_n \xi_n \quad \text{s.t.} \quad y_n(w^\top x_n + b) \geq 1 - \xi_n, \ \xi_n \geq 0$$

Hyperparameter $C > 0$ controls the trade-off between margin maximisation and slack penalisation. The dual problem is identical except $0 \leq \alpha_n \leq C$.

Kernel trick. The dual depends on data only through inner products $x_n^\top x_m$. Replace these with a kernel $K(x_n, x_m)$ corresponding to an inner product in some feature space $\phi$:

$$\max_\alpha \sum_n \alpha_n - \frac{1}{2} \sum_{n, m} \alpha_n \alpha_m y_n y_m K(x_n, x_m)$$

Standard kernels:

Linear: $K(x, y) = x^\top y$
Polynomial: $K(x, y) = (x^\top y + c)^d$
RBF (Gaussian): $K(x, y) = \exp(-\gamma \|x - y\|^2)$
Sigmoid: $K(x, y) = \tanh(\alpha x^\top y + c)$

The decision function is

$$f(x) = \mathrm{sign}\!\left(\sum_n \alpha_n y_n K(x_n, x) + b\right)$$

with the sum running over support vectors only.

Solver: standard libraries (libsvm, scikit-learn) use sequential minimal optimization (SMO), solving two-variable subproblems analytically , to handle the QP efficiently. Time complexity $O(N^2)$ to $O(N^3)$ depending on data, dominating the constant cost of the kernel evaluations.

Primal vs dual: solving the primal form is faster when $d < N$ (more samples than features); the dual is preferred for kernelised SVMs and high-dimensional features. LibLinear specifically handles the linear-SVM primal at scale.

Video

Related terms: Support Vector Machine, vladimir-vapnik, Kernel Trick, Lagrangian and KKT

Discussed in:

Chapter 7: Supervised Learning, Supervised Learning

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).