Supervised Learning: 7.8   Support vector machines

Dr Chris Paton

7.8 Support vector machines

Support vector machines, introduced in their modern form by Cortes and Vapnik in 1995, were the dominant classification method through the late 1990s and 2000s. The idea is geometric and easy to picture: among all hyperplanes that separate two classes, choose the one that sits as far as possible from the nearest training point on either side. That distance is the margin, and the points that touch it are the support vectors. Everything else in the training set could be deleted without changing the decision boundary, which is why SVMs are sometimes called sparse classifiers.

Two ideas pushed SVMs from a tidy geometric trick into a workhorse algorithm. The first was the soft-margin formulation, which allows a controlled amount of misclassification so that the method copes with overlapping classes. The second was the kernel trick, which lets the same machinery fit highly non-linear boundaries simply by replacing inner products with a kernel function, no explicit feature engineering, no extra parameters to learn. Together with strong theoretical guarantees from Vapnik–Chervonenkis (VC) theory, this made SVMs a default choice for tabular and text classification long before deep learning matured.

In 2026 the SVM is no longer the first thing you reach for on perceptual data. Convolutional networks and transformers eat images, audio and language for breakfast, and gradient-boosted trees own most tabular leaderboards. But the SVM has not disappeared. It still performs well on small, high-dimensional problems where deep learning overfits, and it remains one of the cleanest demonstrations of convex optimisation, duality and reproducing-kernel Hilbert spaces in machine learning. Every working machine-learning practitioner should be able to read an SVM and know what each symbol does.

This section closes the chapter's algorithmic core. We have seen linear and logistic regression in §7.2 and §7.3, generalised linear models in §7.4, $k$-nearest neighbours in §7.5, decision trees in §7.6, and naive Bayes in §7.7. The SVM rounds out the classical methods; ensembles and gradient boosting follow in §7.10.

Symbols Used Here

$\mathbf{x}$input feature vector

$y \in \{-1, +1\}$class label (note the $\pm 1$ convention, not $\{0, 1\}$)

$\mathbf{w}$normal vector to the separating hyperplane

$b$bias / offset

$\xi_i$slack variable for example $i$

$C$regularisation constant trading margin width against violations

$\alpha_i$Lagrange multiplier for example $i$

$K(\mathbf{x}, \mathbf{x}')$kernel function (an inner product in some feature space)

7.8.1 The geometry of margins

Imagine a scatter of points in two dimensions, half red and half blue, that can be cleanly separated by a straight line. Infinitely many lines will do it. Some pass within a hair's breadth of a red point and far from any blue; others split the gap evenly. Which is best?

The SVM answer is: the line that is as far as possible from the nearest point of either class. Geometrically, draw the widest slab you can between the two classes, then place the boundary down its centre. The half-width of this slab is the margin, and the points that sit on its edges are the support vectors. The intuition behind maximising it is robustness, a boundary that hugs a single training point will flip if that point shifts even slightly, whereas a boundary in the middle of a wide gap tolerates noise.

Formally, for a hyperplane $\mathbf{w}^\top\mathbf{x} + b = 0$, the signed distance from a point $\mathbf{x}$ to the hyperplane is $(\mathbf{w}^\top\mathbf{x} + b) / \|\mathbf{w}\|$. With labels $y_i \in \{-1, +1\}$, correct classification means $y_i(\mathbf{w}^\top\mathbf{x}_i + b) > 0$. The geometric margin of the classifier is the smallest such distance:

$$\gamma = \min_i \frac{y_i(\mathbf{w}^\top\mathbf{x}_i + b)}{\|\mathbf{w}\|}.$$

The hyperplane is unchanged if we multiply $(\mathbf{w}, b)$ by a positive constant, so we are free to rescale. The conventional choice is to fix the functional margin of the closest points to one: $|\mathbf{w}^\top\mathbf{x}_i + b| = 1$ for those points. With this normalisation, $\gamma = 1/\|\mathbf{w}\|$, and maximising the margin is equivalent to minimising $\|\mathbf{w}\|^2$. The factor of one-half is conventional and makes the gradient cleaner. The two parallel hyperplanes $\mathbf{w}^\top\mathbf{x} + b = +1$ and $\mathbf{w}^\top\mathbf{x} + b = -1$ are the boundaries of the slab; the decision surface is exactly halfway between.

This rescaling step trips up many students. The hyperplane has a direction (set by $\mathbf{w}/\|\mathbf{w}\|$) and a position (set by $b/\|\mathbf{w}\|$). Both are unaltered by the rescaling. What changes is the length of $\mathbf{w}$, and the SVM uses that freedom to fix a convenient unit on the functional margin.

7.8.2 The hard-margin primal

When the classes are linearly separable, the hard-margin SVM is

$$\min_{\mathbf{w}, b} \tfrac{1}{2}\|\mathbf{w}\|^2 \quad\text{subject to}\quad y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 \quad \forall i.$$

The objective is convex (a positive-definite quadratic), the constraints are linear, and the feasible set is non-empty by assumption. This is a textbook convex quadratic programme, with a unique solution in $(\mathbf{w}, b)$. Standard QP solvers find it in polynomial time.

Each constraint $y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1$ says the point $\mathbf{x}_i$ must lie on the correct side of the slab, with at least the margin width to spare. Points strictly outside the slab satisfy the constraint with strict inequality and play no role in determining the optimal $(\mathbf{w}, b)$; only points on the slab boundary, where the inequality binds with equality, do.

Two practical caveats. First, the hard-margin SVM is unsolvable when the classes overlap (as they almost always do in real data); there is no separating hyperplane, the feasible set is empty, the optimisation has no solution. Second, even when the classes are separable, a single noisy outlier can pull the boundary dramatically. Both problems motivate the soft-margin extension below.

7.8.3 Soft margin: real-world SVMs

Real datasets overlap. Even when most of the data is separable, you almost always want to allow a few violations to keep the boundary sensible. The soft-margin SVM introduces non-negative slack variables $\xi_i$ measuring how badly each point breaks the margin, and adds them to the objective with a weight $C$:

$$\min_{\mathbf{w}, b, \boldsymbol\xi} \tfrac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i \quad\text{s.t.}\quad y_i(\mathbf{w}^\top\mathbf{x}_i + b) \geq 1 - \xi_i, \ \xi_i \geq 0.$$

Reading the constraints: $\xi_i = 0$ for points well outside the slab; $0 < \xi_i \leq 1$ for points inside the slab but on the correct side of the boundary; $\xi_i > 1$ for points actually misclassified. The sum $\sum_i \xi_i$ is therefore an upper bound on the number of margin violations (and a fortiori on the training error).

The hyperparameter $C$ is the SVM's regularisation knob. Large $C$ heavily penalises any violation, the boundary contorts to fit the training data, the margin narrows. Small $C$ tolerates violations cheaply, the boundary smooths out, the margin widens. In the limit $C \to \infty$ we recover the hard-margin problem (for separable data); in the limit $C \to 0$ the data are ignored and only $\|\mathbf{w}\|^2$ matters. Practitioners always tune $C$ by cross-validation, typically over a logarithmic grid such as $\{0.01, 0.1, 1, 10, 100\}$.

A useful re-write makes the SVM look like every other regularised loss minimiser in this chapter. Eliminate $\xi_i$ using $\xi_i = \max(0, 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b))$, the smallest non-negative value satisfying the constraint, to obtain

$$\min_{\mathbf{w}, b} \tfrac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \max\big(0,\ 1 - y_i(\mathbf{w}^\top\mathbf{x}_i + b)\big).$$

The second term is the hinge loss, zero on points correctly classified by at least the margin and growing linearly inside or beyond it. The SVM is therefore an empirical risk minimiser with a hinge loss and an $\ell_2$ penalty, the same shape as logistic regression with the log-loss replaced by the hinge. The hinge's flat region is what produces the support-vector sparsity: most points contribute zero gradient.

7.8.4 The Lagrangian dual

A great deal of SVM elegance lives in the Lagrangian dual. Form the Lagrangian for the soft-margin primal with multipliers $\alpha_i \geq 0$ on the margin constraints and $\mu_i \geq 0$ on the slack non-negativity:

$$\mathcal{L} = \tfrac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i - \sum_i \alpha_i\big[y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 + \xi_i\big] - \sum_i \mu_i \xi_i.$$

Stationarity in $\mathbf{w}$, $b$ and $\xi_i$ gives, in turn,

$$\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i, \qquad \sum_i \alpha_i y_i = 0, \qquad \alpha_i + \mu_i = C.$$

Substituting the first into $\mathcal{L}$ and using the others to eliminate $b$ and $\boldsymbol\xi$ produces the dual problem:

$$\max_{\boldsymbol\alpha} \quad \sum_i \alpha_i - \tfrac{1}{2}\sum_{i,j} \alpha_i\alpha_j y_i y_j (\mathbf{x}_i^\top\mathbf{x}_j) \quad\text{s.t.}\quad 0 \leq \alpha_i \leq C,\ \sum_i \alpha_i y_i = 0.$$

This is again a convex QP, but in the $\alpha_i$ rather than in $(\mathbf{w}, b)$. Three features of it deserve attention.

First, the dual depends on the data only through the inner products $\mathbf{x}_i^\top\mathbf{x}_j$. The original feature vectors disappear. This sets up the kernel trick (next subsection): replace the inner product with any valid kernel, and the entire optimisation runs unchanged on a new, possibly infinite-dimensional, feature space.

Second, the Karush–Kuhn–Tucker (KKT) conditions classify each training point. Complementary slackness gives $\alpha_i [y_i(\mathbf{w}^\top\mathbf{x}_i + b) - 1 + \xi_i] = 0$ and $\mu_i \xi_i = 0$. Combining with $\alpha_i + \mu_i = C$ leads to three regimes: $\alpha_i = 0$ for points strictly outside the margin (irrelevant); $0 < \alpha_i < C$ for points exactly on the margin with $\xi_i = 0$ (the margin support vectors, used to recover $b$); $\alpha_i = C$ for points inside the margin or misclassified (the bound support vectors, with $\xi_i > 0$). Only support vectors carry non-zero $\alpha$, and only support vectors influence the boundary.

Third, predictions take a particularly clean form. Substituting $\mathbf{w} = \sum_i \alpha_i y_i \mathbf{x}_i$ into $\hat f(\mathbf{x}) = \text{sign}(\mathbf{w}^\top\mathbf{x} + b)$ gives

$$\hat f(\mathbf{x}) = \text{sign}\!\left( \sum_{i \in SV} \alpha_i y_i (\mathbf{x}_i^\top\mathbf{x}) + b \right),$$

a sum over support vectors only. The bias $b$ is recovered from any margin support vector $\mathbf{x}_s$ via $b = y_s - \mathbf{w}^\top\mathbf{x}_s$, or, more robustly, by averaging this expression over all margin support vectors.

7.8.5 The kernel trick

The dual depends on the data through inner products alone. So if we map the inputs through some non-linear feature map $\phi: \mathbb{R}^d \to \mathcal{H}$ into a richer space $\mathcal{H}$, the only change to the optimisation is replacing $\mathbf{x}_i^\top\mathbf{x}_j$ with $\phi(\mathbf{x}_i)^\top\phi(\mathbf{x}_j)$. Suppose we never compute $\phi(\mathbf{x})$ explicitly but instead supply a function $K(\mathbf{x}, \mathbf{x}')$ that equals an inner product in some feature space. Then the entire SVM machinery, dual optimisation, support vectors, predictions, works in $\mathcal{H}$ without our ever touching it. This is the kernel trick, and it is a startling computational gift. We can fit boundaries in infinite-dimensional spaces using only an $n \times n$ Gram matrix of kernel values.

What makes a function a valid kernel? Mercer's theorem says: $K$ is an inner product in some feature space if and only if it is symmetric positive semi-definite, that is, the Gram matrix $\mathbf{K}_{ij} = K(\mathbf{x}_i, \mathbf{x}_j)$ is symmetric PSD for every finite sample. This abstract characterisation lets us build kernels by combining simpler ones (sums, products, scalings) without ever writing down the underlying $\phi$.

The four kernels you will see most often are:

Linear: $K(\mathbf{x}, \mathbf{x}') = \mathbf{x}^\top \mathbf{x}'$. Recovers the original linear SVM. The right default for high-dimensional sparse data such as bag-of-words text, where the data is already linearly separable in feature space.
Polynomial: $K(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^\top\mathbf{x}' + c)^d$. The implicit feature map contains all monomials of total degree up to $d$. Useful when interaction terms matter and the input dimension is small.
Radial basis function (Gaussian): $K(\mathbf{x}, \mathbf{x}') = \exp(-\gamma\|\mathbf{x} - \mathbf{x}'\|^2)$. The implicit feature space is infinite-dimensional, Taylor-expanding the exponential gives a polynomial of every degree at once. The most general-purpose default, controlled by two hyperparameters, $C$ and $\gamma$. Small $\gamma$ produces a smooth boundary; large $\gamma$ produces a wiggly one (overfitting risk).
Sigmoid: $K(\mathbf{x}, \mathbf{x}') = \tanh(\kappa\, \mathbf{x}^\top\mathbf{x}' + \theta)$. Of historical interest as the nominal connection to neural networks, but not always positive semi-definite, and largely abandoned in practice.

There are also string kernels for biological sequences and graph kernels for molecular structures, both built on the same Mercer-condition recipe.

The grid search over $(C, \gamma)$ for an RBF SVM is the canonical hyperparameter-tuning exercise in machine-learning textbooks. Pick a logarithmic grid for each, run cross-validation, look at the heatmap, take the maximum.

7.8.6 Worked example: a linear SVM on four points

Let us run the algorithm on a tiny dataset to make the abstract concrete. Four two-dimensional points: class $+1$ at $(2, 2)$ and $(3, 3)$, class $-1$ at $(0, 0)$ and $(1, 1)$. The data are linearly separable along the diagonal $x_1 + x_2 = $ constant.

By symmetry the optimal hyperplane is perpendicular to the $(1, 1)$ direction. So $\mathbf{w} = (w, w)$ for some $w > 0$, and the boundary is $w(x_1 + x_2) + b = 0$, i.e. $x_1 + x_2 = -b/w$. The closest points to this line on either side are $(1, 1)$ from class $-1$ and $(2, 2)$ from class $+1$, these are the support vectors. The boundary should sit halfway between them, at $x_1 + x_2 = 3$.

To enforce the canonical normalisation $|\mathbf{w}^\top\mathbf{x} + b| = 1$ at the support vectors, set $w \cdot 1 + w \cdot 1 + b = -1$ at $(1, 1)$ and $w \cdot 2 + w \cdot 2 + b = +1$ at $(2, 2)$. Solving: $2w + b = -1$ and $4w + b = +1$, giving $w = 1$ and $b = -3$. So $\mathbf{w} = (1, 1)$, $\|\mathbf{w}\| = \sqrt{2}$, and the geometric margin is $1/\|\mathbf{w}\| = 1/\sqrt{2}$ on each side, or $\sqrt{2}$ in total slab width. The two non-support vectors $(0, 0)$ and $(3, 3)$ have $|\mathbf{w}^\top\mathbf{x} + b| = 3$, three times the canonical margin, and so contribute $\alpha_i = 0$ in the dual.

This little example is small enough to do entirely by hand, yet it exhibits every feature of the algorithm: the geometric optimum, the role of support vectors, the margin width, the irrelevance of non-support vectors, and the canonical $\pm 1$ normalisation that fixes the scale.

7.8.7 Solving the QP: SMO

The dual is a convex QP in $n$ variables. A general QP solver costs $O(n^3)$, prohibitive past a few thousand examples. John Platt's Sequential Minimal Optimization (1998) decomposes the problem into the smallest possible subproblems, pairs $(\alpha_i, \alpha_j)$, each solved analytically while the other multipliers are held fixed. The equality constraint $\sum_i \alpha_i y_i = 0$ couples the $\alpha$s, so SMO updates two at a time. SMO is the algorithm inside LIBSVM, the reference SVM implementation, and is what you are running whenever you call sklearn.svm.SVC. Even with SMO, kernel SVMs scale roughly as $O(n^2)$ to $O(n^3)$ in training time and $O(n_{SV})$ at prediction time. Past about $10^5$ examples kernel SVMs become impractical, which is part of why deep learning supplanted them on large datasets in the 2010s.

7.8.8 Where SVMs still earn their keep in 2026

Three settings where SVMs remain genuinely useful. First, small-to-medium datasets in high dimensions, text classification with bag-of-words features (hundreds of thousands of dimensions, a few thousand examples), bioinformatics with gene-expression vectors, hyperspectral imaging. The max-margin objective is built-in regularisation and the kernel trick allows non-linear boundaries without an explosion of parameters. Second, theoretical and educational work, the SVM is the cleanest practical demonstration of convex duality, KKT conditions and reproducing kernel Hilbert spaces in machine learning, and remains a fixture of every serious ML course. Third, anomaly detection via the one-class SVM, which fits a tight boundary around the bulk of the training data and flags points outside it as anomalies.

For multi-class problems, decompose into one-vs-rest or one-vs-one binaries. For probability outputs, fit Platt scaling, a sigmoid on the SVM decision values, calibrated on a held-out set. For very large datasets, switch to a linear SVM trained by stochastic methods (LIBLINEAR), which scales to millions of examples but loses the kernel trick.

from sklearn.svm import SVC
from sklearn.datasets import make_moons
from sklearn.model_selection import GridSearchCV, train_test_split

X, y = make_moons(n_samples=400, noise=0.25, random_state=0)
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.25, random_state=0)
grid = GridSearchCV(SVC(),
    {"kernel": ["rbf"], "C": [0.1, 1, 10, 100], "gamma": [0.1, 1, 10]},
    cv=5, n_jobs=-1).fit(Xtr, ytr)
print("Best params:", grid.best_params_, " test acc:", grid.score(Xte, yte))

7.8.9 What you should take away

The SVM finds the maximum-margin separating hyperplane. Among all linear separators, it picks the one that sits as far as possible from the nearest training point on either side; the margin equals $1/\|\mathbf{w}\|$ under canonical normalisation.
Soft margin makes it practical. Slack variables and the regularisation constant $C$ trade off margin width against margin violations; the primal can be rewritten as hinge-loss plus $\ell_2$ penalty, placing the SVM in the family of regularised empirical-risk minimisers.
The dual exposes support vectors and inner products. Only a sparse subset of training points (those with $\alpha_i > 0$) determines the boundary, and the optimisation depends on the data only through inner products $\mathbf{x}_i^\top\mathbf{x}_j$.
The kernel trick fits non-linear boundaries cheaply. Replace the inner product with any positive semi-definite kernel, RBF is the default, and the SVM operates in a richer (possibly infinite-dimensional) feature space without ever computing the feature map; tune $(C, \gamma)$ by cross-validation.
Use SVMs on small, high-dimensional problems. Deep learning has won perceptual tasks and gradient boosting owns most tabular data, but the SVM still shines on text, bioinformatics and similar small-$n$, large-$d$ settings, and remains the canonical worked example for convex duality and kernel methods.