Supervised Learning: 7.5   $k$-Nearest Neighbours

Dr Chris Paton

7.5 $k$-Nearest Neighbours

Of all the algorithms that sit under the banner of supervised learning, $k$-nearest neighbours is the one that an outsider can grasp without any background in mathematics. The idea is direct: if you want to know the label of a new example, look at the training examples that resemble it most closely and ask them to vote. There is no curve to fit, no parameter vector to estimate, no loss to minimise. The training set itself is the model. To classify a fresh patient based on age, blood pressure and resting heart rate, you scan your records, locate the handful of past patients who look most similar on those three dimensions, and predict whatever outcome was most common among them. For regression you average rather than vote, but the spirit is identical. This is sometimes called lazy learning because all of the work is deferred until inference time. The "training" phase consists of nothing more than copying the dataset into memory.

That conceptual cleanness is what earns $k$-nearest neighbours its place at the start of any honest tour of classical machine learning. It separates the question what does it mean to learn from data? from the question how do we squeeze a learning rule into a small parameter vector? You can teach $k$-NN to a clinician in a single sentence, and once the clinician understands it, every other supervised algorithm in this chapter can be framed as an attempt to recover its strengths while shedding its weaknesses. Decision trees abandon distances in favour of axis-aligned questions; logistic regression replaces local voting with a globally fitted boundary; neural networks learn the embedding space in which "nearest" finally means what we want it to mean. None of those algorithms displaces $k$-NN so much as repackages parts of it. As we shall see at the end of this section, the same vote-among-neighbours procedure, lifted into the embedding space of a modern transformer, is the engine inside every retrieval-augmented generation system in production today.

Symbols Used Here

$\mathbf{x}$query point

$\mathcal{D}$training set

$d(\mathbf{x}, \mathbf{x}')$distance metric

$k$number of neighbours

$N_k(\mathbf{x})$set of $k$ nearest neighbours of $\mathbf{x}$

The algorithm

Choose $k$ and a distance.
To predict for $\mathbf{x}$: find $N_k(\mathbf{x})$ in $\mathcal{D}$.
Classification: majority vote. Regression: average.

That's it.

The brevity of the recipe is not a sign that something has been left out. It is the entire procedure. Strip away the implementation tricks and what remains is a sort of empirical Bayes rule applied locally: the posterior probability of class $c$ at the query point is approximated by the frequency of class $c$ among the $k$ training points that surround it. When $k$ is small, that estimate is jittery but exquisitely sensitive to local structure. When $k$ is large, the estimate is smoother but increasingly tells you about the global class balance rather than anything specific to $\mathbf{x}$. Choosing $k$ is therefore not a matter of taste but of bias and variance, and we will return to it shortly.

Two finer points of the recipe deserve emphasis before we proceed. First, the vote does not have to be uniform. A common refinement weights each neighbour by $1/d_i$ or by $\exp(-d_i^2/\sigma^2)$, so that close neighbours speak more loudly than the ones at the edge of the neighbourhood. This softens the algorithm's behaviour near class boundaries and reduces the impact of choosing $k$ slightly too large. Second, although the algorithm has no training cost in the conventional sense, it does have a preparation cost: you must decide what features to use, how to scale them, and which distance to compute. Those choices are no less consequential than the choice of architecture in a neural network. They are, in effect, where the learning happens, done by the analyst rather than by the optimiser.

Worked example

Training: 3 red points $(1,1), (2,1), (1,2)$ and 3 blue points $(5,5), (6,5), (5,6)$. Query $\mathbf{x} = (3, 3)$. Euclidean distances: red points at $\sqrt{8}, \sqrt{5}, \sqrt{5} \approx 2.24, 2.24, 2.24$. Blue points at $\sqrt{8}, \sqrt{13}, \sqrt{13} \approx 2.83, 3.61, 3.61$. With $k = 3$: 3 nearest are all red. Predict red.

Trace the arithmetic by hand once and the algorithm becomes muscle memory. The query lies in the gap between two clearly separated clusters, slightly closer to the red cluster centred near $(1.3, 1.3)$ than to the blue cluster centred near $(5.3, 5.3)$. Computing each distance requires nothing more elaborate than Pythagoras. The red point $(2,1)$ is at distance $\sqrt{(3-2)^2 + (3-1)^2} = \sqrt{1 + 4} = \sqrt{5}$; the red point $(1,2)$ is symmetrically at $\sqrt{4+1} = \sqrt{5}$; and the corner $(1,1)$ is at $\sqrt{4+4} = \sqrt{8}$. The closest blue point, $(5,5)$, is also at $\sqrt{4+4} = \sqrt{8}$, but every other blue point is further still. So the three smallest distances are produced by three red points, and a majority vote among three reds yields a red prediction.

It is instructive to ask what happens when we vary $k$. With $k = 1$, the prediction is again red, because the single nearest training point (one of the two reds tied at $\sqrt{5}$) is red. With $k = 4$, we admit the closest blue point $(5,5)$ alongside the three reds, and a 3-versus-1 vote still favours red. Only when $k$ grows to 5 or 6 does the blue cluster start to dilute the verdict, and at $k = 6$ the vote is split evenly between three reds and three blues, at which point a tie-breaking rule (smallest summed distance, lowest class label, random) is needed.

Now move the query to $(4, 4)$ and the picture changes. The distances to red are $\sqrt{18}, \sqrt{13}, \sqrt{13}$ and to blue they are $\sqrt{2}, \sqrt{5}, \sqrt{5}$, so all three nearest neighbours are blue and the prediction flips. Somewhere between $(3,3)$ and $(4,4)$ lies a decision boundary, the locus of query points at which the vote is exactly tied. With Euclidean distance and uniform weights, the $k$-NN decision boundary is a piecewise-linear surface called the Voronoi diagram for $k=1$, and a smoothed version of it for larger $k$. Sketching it on graph paper for half a dozen training points is the single best exercise for understanding why $k$-NN behaves the way it does, and why its boundary becomes ragged when $k$ is small and smooth when $k$ is large.

Distance metrics

Euclidean (L2): $\sqrt{\sum (x_i - x'_i)^2}$. Standard for continuous features.
Manhattan (L1): $\sum |x_i - x'_i|$. Robust to outliers.
Cosine: $1 - \mathbf{x} \cdot \mathbf{x}'/(\|\mathbf{x}\|\|\mathbf{x}'\|)$. For high-dim sparse data (text).
Hamming: number of differing positions. For binary data.
Mahalanobis: $\sqrt{(\mathbf{x} - \mathbf{x}')^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \mathbf{x}')}$. Accounts for feature covariance.

The Euclidean metric is the default because we are taught it as the literal distance between points in a plane, but it has two well-known weaknesses. It assumes that every feature is equally informative, and it assumes that the feature axes are orthogonal in any meaningful sense. Both assumptions fail almost immediately in real data. If one feature is age in years (typical range 0–100) and another is annual income in pounds (typical range 0–100,000), then Euclidean distance is essentially income distance: age contributes nothing. The standard fix is to standardise every feature to zero mean and unit variance before computing distances, or to scale to the unit interval. Without one of these steps, $k$-NN silently devolves into "nearest neighbour in whichever feature happens to have the largest numeric range". This is the single most common pitfall when applying the algorithm in practice, and it can make a working system look broken until you spot it.

Manhattan distance, summing absolute differences along each axis, is less affected by extreme values in any one coordinate and is natural for grid-like data such as city blocks or pixel locations. Cosine similarity, by contrast, ignores magnitude entirely and measures only the angle between two vectors. This is exactly what you want for sparse text representations, where two documents may have very different word counts but cover the same topics, the angle between their term-frequency vectors captures topical similarity in a way that Euclidean distance, dominated by document length, cannot. Hamming distance, counting positions at which two binary vectors disagree, is the right choice for genetic data, hashed fingerprints, or any setting in which features are categorical with no ordering. Mahalanobis distance generalises Euclidean by introducing an inverse covariance matrix $\boldsymbol{\Sigma}^{-1}$ that rescales each direction according to how much the data varies along it; if some features are correlated or have wildly different variances, Mahalanobis effectively hands you back the Euclidean distance you would have got after a sensible whitening transformation. Choosing the right distance is half the battle in $k$-NN, and unfortunately the choice has to come from understanding the data, not from a formula.

Choosing k

Larger $k$: smoother decision boundary, more bias, less variance. Smaller $k$: jagged, low bias, high variance. $k = 1$: classifier passes through every training point.

Rule of thumb: $k = \sqrt{n}$. Tune via cross-validation.

The bias–variance trade-off is rarely as transparent as it is here. With $k = 1$, the decision boundary threads its way around every individual training point: training error is zero by construction, but a single mislabelled or atypical example can carve out an entire pocket of incorrect predictions in its vicinity. The model has no resilience. As $k$ grows, each prediction is averaged over a larger neighbourhood and the influence of any single training point shrinks. This stabilises the predictions, variance falls, but it also blurs sharp features of the underlying class structure, introducing bias. At the other extreme, $k = n$ predicts the global majority class everywhere, which is the maximally biased and minimally variable model possible. Somewhere between $k = 1$ and $k = n$ lies a sweet spot, and the position of that sweet spot depends on the dataset.

The square-root heuristic, $k \approx \sqrt{n}$, has a respectable pedigree but no firm theoretical justification beyond "scale $k$ slowly with $n$". For $n = 100$ it suggests $k = 10$; for $n = 10{,}000$ it suggests $k = 100$. In practice, the only reliable way to choose $k$ is cross-validation: split the training data into folds, sweep $k$ over a grid such as $\{1, 3, 5, 7, 11, 15, 21, 31, 51\}$, evaluate held-out accuracy for each, and pick the $k$ that does best. Two refinements are worth knowing. First, prefer odd $k$ for binary classification to avoid ties. Second, the cross-validated curve of accuracy against $k$ is typically flat across a wide region; do not chase a fractional improvement at one specific $k$ without checking that it is reproducible across resamples of the data.

A subtle but practical consideration is that the optimal $k$ is sensitive to how you measure error. If false positives and false negatives carry different costs, a missed cancer is not a false alarm, then a $k$ tuned to maximise overall accuracy may be wrong for the application. Tune to the metric that matters: balanced accuracy, $F_1$, area under the ROC curve, or expected clinical utility. The choice of $k$ is, like everything else in supervised learning, only as good as the loss function it serves.

Computational cost

Naive nearest-neighbour search: $O(nd)$ per query. For $n = 10^6$, $d = 100$: $10^8$ operations per query, millisecond on CPU. Tractable but slow.

Acceleration: KD-trees ($d$ small), Ball trees, locality-sensitive hashing (LSH), HNSW (hierarchical navigable small worlds). Modern vector databases (FAISS, Pinecone, Milvus) use ANN methods to query billions of vectors in milliseconds.

The naive approach scales linearly with both the number of training points and the dimensionality. That is fine when you have ten thousand records of patient demographics, and ruinous when you have a hundred million sentence embeddings. KD-trees partition space along axis-aligned planes, halving the search region at each level; they reduce query time to $O(\log n)$ in low dimensions but degrade towards $O(n)$ once $d$ exceeds about twenty, because in high dimensions the bounding boxes overlap so much that pruning fails. Ball trees swap axis-aligned boxes for nested hyperspheres and behave a little better in moderate dimensions, but the same fundamental obstacle eventually catches them. Locality-sensitive hashing trades exactness for speed: random projections are arranged so that nearby points are likely to land in the same bucket, allowing sub-linear approximate retrieval at the cost of occasionally missing a true neighbour. HNSW builds a multi-layer graph in which long edges sit at the top and short edges at the bottom, then performs a greedy walk that descends through the layers; it is the dominant algorithm in modern vector databases and the one you should reach for when serving billions of embeddings. The same handful of techniques, kd-tree, ball tree, LSH, HNSW, power FAISS, Pinecone, Milvus, Weaviate and every other production retrieval system, regardless of the marketing copy.

Where k-NN works well

Small data, simple problems: very fast to set up, no training.
Local structure: when the function changes smoothly with neighbours.
Ranking / retrieval: the entire RAG pipeline is k-NN over embedding vectors.
Anomaly detection: distance-to-k-th-nearest as outlier score.

Where it fails: high-dimensional data (curse of dimensionality kills distance meaningfulness), large datasets without ANN structures, problems where the relevant features have wildly different scales.

The strongest case for $k$-NN is the case in which you have a few thousand examples, a handful of well-chosen features, and a real need to ship something quickly. There is no model to debug, no hyperparameter sweep beyond $k$, and the predictions are immediately interpretable: the nearest neighbours can be displayed alongside the answer, giving the user a kind of evidence base. In medical decision support this transparency is genuinely valuable, clinicians trust an algorithm more when it can point to three previous similar patients than when it announces a probability from an opaque network. Anomaly detection is another natural fit: the distance to the $k$-th nearest training point is a principled outlier score, with no distributional assumptions baked in. And in any retrieval setting where the similarity of two items is what we ultimately care about, recommending a film, finding the closest paper, retrieving relevant context for a language model, $k$-NN is not a baseline so much as the actual deployed system.

The failure modes are the mirror image. In genuinely high-dimensional raw feature spaces, the curse of dimensionality flattens distances: random points in a thousand-dimensional cube are all roughly equidistant from each other, so "nearest" loses its meaning. Without unscaled features, the wrong axis dominates. Without an ANN index, querying a million-row dataset becomes too slow for an interactive application. And without representative training data, $k$-NN cannot extrapolate at all: it is a memoriser, by design.

Modern descendants

In 2026, k-NN survives in the form of:

Vector retrieval: every RAG system, every embedding-based search.
Memory-augmented neural networks: external memory queried by attention.
Few-shot learning via retrieval: kNN-LM (Khandelwal 2020), retrieval-augmented language models.

The same algorithm, lifted by learned embeddings, becomes the backbone of retrieval in modern AI.

The reason $k$-NN still earns its keep half a century after Cover and Hart's analysis is that it has been hybridised with neural networks rather than displaced by them. A modern retrieval pipeline encodes every document into a 768- or 1024-dimensional vector using a transformer trained for semantic similarity; at query time, the user's question is encoded by the same model and the top-$k$ nearest documents in that learned space are pulled back as context. The HNSW graph that performs the lookup is doing exactly what $k$-NN has always done. Memory-augmented architectures such as Memorising Transformers and Retro extend the same idea inside the model itself, allowing a network to consult an external store of past activations through what is, structurally, a $k$-NN read. Khandelwal and colleagues' kNN-LM showed that interpolating a standard language model with a $k$-NN over its own training set's hidden states improved perplexity without any additional fine-tuning. The pattern is now ubiquitous: learn an embedding with deep networks, then run $k$-NN in the embedding space.

What you should take away

The algorithm is just a vote among neighbours. Choose a distance, find the $k$ closest training points to the query, and let them decide. Classification by majority, regression by mean.
The choice of $k$ is the bias–variance dial. Small $k$ overfits, large $k$ underfits, and cross-validation is the only reliable way to pick the right value for your data and your loss function.
The choice of distance and feature scaling matters more than $k$. Standardise features before computing Euclidean distances; pick cosine for sparse text; pick Mahalanobis when features have correlated scales. Get this wrong and the algorithm is silently broken.
Naive search is $O(nd)$ per query, but ANN structures rescue it. KD-trees, ball trees, LSH and HNSW reduce the cost from a full scan to a logarithmic or sub-linear lookup, and underpin every production vector database.
$k$-NN is alive in modern AI as retrieval. Lifted into a learned embedding space by a transformer, the same vote-among-neighbours procedure becomes the engine of RAG, memory-augmented models and kNN-LM. The simplest classifier in the book is also the most widely deployed.