Unsupervised Learning: 8.6   Density-based clustering: DBSCAN and HDBSCAN

Dr Chris Paton

8.6 Density-based clustering: DBSCAN and HDBSCAN

The clustering methods we have seen so far carry awkward assumptions. K-means insists on spherical, roughly equal-sized clusters and demands the user specify $K$ in advance. Hierarchical clustering builds a dendrogram but still requires a cut. Gaussian mixture models, examined in §8.5, soften the assignment but keep the parametric shape. Real data does not always cooperate. Geographic survey points trace winding rivers, not tidy circles. Stars in a galactic plane sit in elongated arms. Customer purchase patterns concentrate around a few popular hubs while a long tail of unusual buyers floats in between.

DBSCAN, Density-Based Spatial Clustering of Applications with Noise, was introduced by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996 Ester, 1996. It reframes the problem from the ground up. Rather than partitioning every point, DBSCAN searches for dense regions of the feature space and treats whatever falls outside those regions as noise. The number of clusters $K$ is discovered, not declared. Cluster shapes can be arbitrary, concentric rings, crescents, spirals, because the method asks only one question: are my neighbours close enough and numerous enough? Outliers are not awkward exceptions to be quietly absorbed; they are first-class outputs.

Where §8.5 treated clusters as probability mass under fitted Gaussians, §8.6 treats them as connected regions of high density. The two ideas are complementary, but the density view scales gracefully to shapes no Gaussian can describe. Today DBSCAN appears in geospatial analytics, anomaly detection, image segmentation, bioinformatics, and as a default clustering tool in scikit-learn. Its hierarchical successor, HDBSCAN, removes its single most fragile parameter and has displaced it in many production pipelines.

Symbols Used Here

$\varepsilon$radius parameter defining neighbourhood size

$\mathrm{minPts}$minimum points required for a dense region

$N_\varepsilon(\mathbf{x})$$\varepsilon$-neighbourhood of $\mathbf{x}$

$d_{\text{core}}(\mathbf{x})$distance to the $\mathrm{minPts}$-th nearest neighbour

$d_{\text{mreach}}$mutual reachability distance (HDBSCAN)

8.6.1 The algorithm

DBSCAN takes two parameters. The radius $\varepsilon > 0$ controls the size of each point's neighbourhood, and $\mathrm{minPts} \in \mathbb{N}$ sets the density threshold a region must clear to count as a cluster. From these two numbers a small vocabulary follows.

The $\varepsilon$-neighbourhood of a point $\mathbf{x}$ is

$$N_\varepsilon(\mathbf{x}) = \{\mathbf{y}\in\mathcal{D}: \lVert\mathbf{x}-\mathbf{y}\rVert\le\varepsilon\}.$$

A point is classified into one of three categories.

Core points have at least $\mathrm{minPts}$ neighbours within $\varepsilon$, that is, $\lvert N_\varepsilon(\mathbf{x}) \rvert \ge \mathrm{minPts}$. They sit firmly inside a dense region.
Border points lie inside the $\varepsilon$-neighbourhood of a core point but are not themselves core. They cling to the edge of a dense region.
Noise is whatever is left, points that are neither core nor border.

A point $\mathbf{y}$ is directly density-reachable from a core point $\mathbf{x}$ if $\mathbf{y} \in N_\varepsilon(\mathbf{x})$. The full reachability relation chains direct steps together: $\mathbf{y}$ is density-reachable from $\mathbf{x}$ if there is a sequence of core points $\mathbf{x} = \mathbf{p}_1, \mathbf{p}_2, \ldots, \mathbf{p}_k = \mathbf{y}$ in which each is directly density-reachable from its predecessor. Two points are density-connected if some core point reaches them both. A cluster is the symmetric closure: a maximal set of mutually density-connected points, taking all the core points and dragging their borders along with them.

The procedure that turns these definitions into code is short. Mark every point unvisited. Pick an unvisited point $p$. Compute its neighbourhood. If the neighbourhood is too small, label $p$ noise, provisionally, because it might later turn out to be a border to some other core. Otherwise open a new cluster, seed it with $N_\varepsilon(p)$, and grow the cluster outward: for every seed $q$, add $q$ to the cluster, and if $q$ is itself core, fold its neighbours into the seed set. When the seed set empties, the cluster is finished. Move on to the next unvisited point.

The cost is dominated by neighbourhood queries. Naively each query costs $O(n)$ and the algorithm is $O(n^2)$. With a spatial index, KD-tree, Ball-tree, or R*-tree, each query collapses to $O(\log n)$ amortised, and the algorithm runs in $O(n \log n)$. This is what makes DBSCAN practical on millions of low-to-moderate dimensional points such as GPS coordinates, image pixels, or two-dimensional embeddings.

8.6.2 Worked example

Imagine ten points on a plane: six clustered tightly in the bottom-left around the origin, three loosely spaced in the top-right, and one isolated point off to the side. We set $\varepsilon = 1.0$ and $\mathrm{minPts} = 3$.

The tight cluster has points at $(0, 0), (0.3, 0.2), (-0.2, 0.4), (0.4, -0.3), (0.1, 0.1)$ and $(-0.3, -0.2)$. Each one's $\varepsilon = 1.0$ neighbourhood contains the other five, they are all comfortably within unit distance. Every point therefore has at least five neighbours, well above $\mathrm{minPts} = 3$. All six are core. Their neighbourhoods overlap, so they are mutually density-connected and the algorithm fuses them into a single cluster.

The loose cluster has points at $(5, 5), (5.6, 5.0)$ and $(5.0, 5.7)$. Each one finds two neighbours within radius $1.0$, the other two of the trio. With $\mathrm{minPts} = 3$ the count falls one short of the threshold. None of them is core. None can therefore seed a cluster. They are eligible only as borders to some core elsewhere, and there is no core point within $\varepsilon$. So all three are labelled noise, and what looked like a cluster to the human eye dissolves entirely.

The lonely outlier sits at $(10, 10)$ with no neighbours within radius $1.0$. It is trivially noise.

The lesson is sharp. With these settings DBSCAN finds exactly one cluster, the dense bouton, and labels four points (the loose trio plus the lonely outlier) as noise. Compare this to k-means with $K = 2$, which would happily split the data into two clusters of five points each, ignoring the structural difference between dense and sparse. Compare it also to k-means with $K = 3$, which would invent a cluster around the outlier rather than admit it is anomalous.

If we relax to $\mathrm{minPts} = 2$, the loose cluster becomes a cluster: each of its three points has two neighbours, meeting the threshold. We would then have two clusters and one noise point. If we further raise $\varepsilon$ to, say, $1.5$ while keeping $\mathrm{minPts} = 3$, the loose cluster's points each acquire two neighbours but still not three within range, they remain noise. Push $\varepsilon$ to $2.0$ and the loose cluster begins to qualify. The exercise illustrates the central tension that the next subsection addresses: tweaking $\varepsilon$ rescues some clusters and dissolves others; there is no globally correct answer.

8.6.3 Strengths and weaknesses

The strengths follow directly from the algorithm. There is no need to specify $K$ in advance, clusters appear wherever the density permits. Cluster shapes are unconstrained: spirals, rings, and elongated bands all emerge naturally. Outliers are not forced into the nearest cluster; they are flagged as noise. With a spatial index the running time is $O(n \log n)$, which scales to tens of millions of points in two or three dimensions. The two parameters have intuitive geometric meanings: the radius of a "neighbourhood" and the minimum number of points it should hold. Many real datasets, taxi pickup locations, supermarket shoppers' postcodes, lightning strikes, have exactly the kind of irregular, density-varying structure DBSCAN was built to find.

The weaknesses are equally direct. DBSCAN is sensitive to $\varepsilon$. Too small and dense clusters fragment into shards while sparse ones vanish. Too large and everything merges into a single blob spanning the dataset. Choosing $\varepsilon$ from a k-distance plot, the points sorted by distance to their $k$-th nearest neighbour, with $k = \mathrm{minPts} - 1$, gives a useful heuristic. A clear elbow in the plot suggests a value that separates dense interior points from sparse outliers. A common rule of thumb takes $\mathrm{minPts} \ge d + 1$ where $d$ is the data's dimensionality, with $\mathrm{minPts} = 4$ adequate for two-dimensional data.

The deeper limitation is that a single global $\varepsilon$ cannot adapt to clusters of differing densities. If one cluster is tightly packed and another is loose, no single radius captures both. Set $\varepsilon$ small to resolve the tight cluster and the loose one shatters into noise. Set $\varepsilon$ large to capture the loose cluster and the tight one merges with its neighbours. Density-varying landscapes, common in geospatial, biological and astronomical data, defeat plain DBSCAN.

High-dimensional data introduces another problem. As dimensionality grows, distances between points concentrate around a common value: nearest and farthest neighbours become indistinguishable. The $\varepsilon$-neighbourhood loses its discriminating power. DBSCAN, like every distance-based method, suffers from this curse of dimensionality and is generally unhelpful beyond about ten or fifteen features without dimensionality reduction (such as PCA or UMAP) first.

Finally, the assignment of border points is not deterministic. A border point in the neighbourhood of two competing core points can be assigned to either cluster depending on processing order. This is a small concern in practice but breaks strict reproducibility.

8.6.4 HDBSCAN

HDBSCAN, Hierarchical DBSCAN, was introduced by Ricardo Campello, Davoud Moulavi and Jörg Sander in 2013 Campello, 2013. It removes the global $\varepsilon$ choice and replaces it with a hierarchy of clusters at all density levels, then selects the most stable clusters automatically.

The algorithm proceeds in three stages. First, it computes the mutual reachability distance between points,

$$d_{\text{mreach}}(\mathbf{x},\mathbf{y}) = \max\{d_{\text{core}}(\mathbf{x}),\, d_{\text{core}}(\mathbf{y}),\, d(\mathbf{x},\mathbf{y})\},$$

where $d_{\text{core}}(\mathbf{x})$ is the distance from $\mathbf{x}$ to its $\mathrm{minPts}$-th nearest neighbour. This metric pushes sparse points further from one another while leaving dense regions intact. Second, it builds a minimum spanning tree on the mutual reachability distances and converts the tree into a hierarchical cluster tree by removing edges in decreasing order of weight. Third, it scores each candidate cluster by its excess of mass, a measure of how long the cluster persists across density thresholds, and picks the partition that maximises total stability.

The single user parameter is $\mathrm{min\_cluster\_size}$, the smallest grouping the user is willing to call a cluster. There is no $\varepsilon$. Clusters of differing densities coexist naturally because the algorithm considers every density level in turn rather than fixing one. The output, in addition to cluster labels, includes a soft membership score for each point and a robust noise label for points that never join a stable cluster.

HDBSCAN is computationally heavier than DBSCAN, typically $O(n \log n)$ with careful indexing but with a larger constant, but its results are more reliable and require less hand-tuning. It is now the default density-based method in many production pipelines, especially those handling embeddings produced by neural networks or dimensionality-reduction methods like UMAP (see §8.11).

import numpy as np
from sklearn.cluster import DBSCAN
import hdbscan

X = ...  # n x d data
db = DBSCAN(eps=0.5, min_samples=5).fit(X)
hdb = hdbscan.HDBSCAN(min_cluster_size=20).fit(X)
print("DBSCAN labels:", np.unique(db.labels_), "(-1 = noise)")
print("HDBSCAN labels:", np.unique(hdb.labels_))

8.6.5 Where DBSCAN is used

Geospatial analytics. Clustering GPS coordinates of taxi pickups, mobile-phone pings, ship AIS tracks, or wildlife sightings. Dense regions correspond to hotspots; the noise label catches solitary or anomalous points. Satellite imagery uses DBSCAN to find hot pixels and group them into fires, ships, or buildings.
Anomaly detection. Because every point not in a cluster is noise, DBSCAN doubles as an outlier detector. Network intrusion logs, credit card transactions and sensor streams are routinely clustered with DBSCAN to surface unusual events automatically.
Bioinformatics. Clustering single-cell RNA-seq embeddings, protein–protein interaction networks, and phylogenetic profiles, where the natural "cell types" or "functional groups" rarely look spherical and outliers are biologically meaningful.
Image segmentation. Pixels in a colour-spatial feature space form natural density clusters along edges and within regions; DBSCAN groups them without needing to commit to a number of segments.
Astronomy. Identifying star clusters, galaxy filaments, and gravitational-lensing arcs, none of which fit the spherical assumption baked into k-means.
Customer analytics. Density-based clusters of purchase histories or session-behaviour embeddings reveal natural cohorts; the noise label flags one-off or fraudulent users.

In modern pipelines, DBSCAN and HDBSCAN are often the second step after a non-linear embedding such as UMAP. The embedding contracts the data into a low-dimensional space where density becomes meaningful again, and the density-based clusterer carves it cleanly.

8.6.6 What you should take away

Density-based clustering needs no $K$. DBSCAN discovers the number of clusters from the data, recovers arbitrary shapes, and labels outliers explicitly as noise.
The vocabulary is small but precise. Every point is core, border, or noise. Clusters are maximal density-connected components, chains of overlapping core neighbourhoods, trailing their borders.
Two parameters, one fragility. $\varepsilon$ and $\mathrm{minPts}$ both matter, but $\varepsilon$ is by far the more sensitive. The k-distance plot helps, yet a single global radius cannot serve clusters of differing densities.
HDBSCAN inherits the strengths and removes the flaw. By examining all density levels at once and selecting clusters by stability, HDBSCAN eliminates the $\varepsilon$ choice and handles density-varying landscapes that defeat plain DBSCAN.
Pair density-based clustering with sensible preprocessing. Distance-based density loses meaning in high dimensions; reduce dimensionality first with PCA or UMAP, scale features so the metric is meaningful, and treat the noise label not as failure but as one of the algorithm's most useful outputs.