Unsupervised Learning: 8.11   UMAP

Dr Chris Paton

8.11 UMAP

Uniform Manifold Approximation and Projection McInnes, 2018, known universally by its acronym UMAP, is the modern workhorse for visualising high-dimensional data. Leland McInnes and his collaborators released it in 2018, and within two years it had displaced t-SNE as the default choice in single-cell genomics, in natural language embedding visualisation, and in any setting where an analyst wants a quick two-dimensional sketch of how a few hundred thousand vectors are arranged. UMAP is faster than t-SNE, scales to millions of points without exotic hardware, preserves more of the global geometry, and produces embeddings that, in practice, are easier to read at a glance.

The mathematics underpinning UMAP is dressed in the language of fuzzy simplicial sets, Riemannian geometry and category theory, which can make the original paper forbidding. Strip the topology away, however, and the algorithm is straightforward: build a weighted neighbour graph in the high-dimensional space, build a similar graph in two or three dimensions, and shuffle the low-dimensional coordinates until the two graphs agree. We will keep the framing concrete throughout this section. The first author has a habit of writing software that is both faster and more user-friendly than its theoretical pedigree suggests, and UMAP is the canonical example.

UMAP solves the same problem as t-SNE (§8.10), produce a faithful 2D or 3D picture of high-dimensional data, but optimises a different objective on a different graph, with different consequences for what the resulting picture means. Both algorithms produce embeddings; neither preserves absolute distances; both should be used as exploratory tools, not as evidence.

Symbols Used Here

$\mathbf{x}_i$high-dim data point

$\mathbf{y}_i$low-dim embedding

$n$number of points

$d$high-dim

$k$number of nearest neighbours

What UMAP does

UMAP unfolds in three stages. First, for each point $\mathbf{x}_i$ in the input, locate its $k$ nearest neighbours under whatever metric is appropriate (Euclidean for raw features, cosine for unit-normalised embeddings, Hamming for binary vectors). The local distance scale around point $i$ is set by $\rho_i$, the distance to the nearest neighbour, and a bandwidth $\sigma_i$ chosen so that the entropy of the membership distribution matches $\log_2 k$. The membership strength of edge $(i,j)$ in the high-dimensional fuzzy graph is

$$ \mu_{j\mid i} = \exp\!\Bigl(-\max\bigl(0,\,d(\mathbf{x}_i,\mathbf{x}_j)-\rho_i\bigr)/\sigma_i\Bigr), $$

which is one when $j$ is the closest neighbour of $i$ and decays exponentially for more distant points. The graph is then symmetrised by fuzzy union: $\mu_{ij} = \mu_{j\mid i} + \mu_{i\mid j} - \mu_{j\mid i}\mu_{i\mid j}$. This bakes in the assumption that the data lie on a manifold whose local geometry is approximately Riemannian, neighbourhoods are well defined, but global distances need not be.

Second, initialise the low-dimensional coordinates $\mathbf{y}_i$. UMAP uses the spectral embedding of the symmetrised graph (a Laplacian eigenmap) by default, which gives a sensible global layout before optimisation begins. This initialisation is one reason UMAP preserves global structure better than t-SNE, whose default random initialisation discards any global information from the start.

Third, optimise the low-dimensional coordinates by stochastic gradient descent. The low-dimensional similarity is parameterised by

$$ \nu_{ij} = \bigl(1 + a\,\lVert \mathbf{y}_i - \mathbf{y}_j \rVert^{2b}\bigr)^{-1}, $$

where $a$ and $b$ are fitted from the user-set min_dist. The loss for each edge is the binary cross-entropy

$$ -\mu_{ij}\log\nu_{ij} - (1-\mu_{ij})\log(1-\nu_{ij}), $$

so neighbours in the high-dim graph attract in the embedding while non-neighbours repel. UMAP draws negative samples (random non-edges) at each step rather than computing the full repulsive sum, which is what gives it its $O(n)$-per-epoch character and makes it scale to millions of points.

The contrast with t-SNE's KL-divergence-on-joint-probabilities objective is worth pausing on. t-SNE asks: does the joint probability $p_{ij}$ over all pairs match $q_{ij}$? UMAP asks: for each edge independently, is the membership preserved? The shift from a global probability distribution to per-edge cross-entropy is what frees UMAP from t-SNE's normalisation constant, which in turn is what enables negative sampling and the speed advantage. The geometric intuition is also slightly different: t-SNE places points so that the overall similarity distribution is preserved, UMAP places points so that the connectivity of the neighbour graph is preserved. Both goals are reasonable; they produce different pictures.

Hyperparameters

UMAP exposes four hyperparameters that an analyst will actually touch.

n_neighbors controls the size of the local neighbourhood used to construct the high-dim graph; the default of 15 is a good starting point. Smaller values (2–10) emphasise micro-structure: the embedding will fragment into many small clusters and reflect the fine texture of the data. Larger values (30–200) emphasise the global picture: the embedding will look smoother, larger structures will be more faithfully placed, but small clusters will dissolve. As a heuristic, set n_neighbors to the rough size of the smallest meaningful subpopulation you expect to see, if you suspect rare cell types of around 30 cells in a single-cell dataset, use 30; if you only care about gross continental structure in a population genetics dataset, use 200.

min_dist controls the minimum separation between points in the embedding. The default 0.1 produces tight clusters with clear gaps between them. Values near 0 (e.g. 0.01) pack points together so that the cluster shapes are visible but the inter-cluster gaps shrink, which can be useful when you care about local topology more than separation. Values near 0.5 spread the embedding out, which is helpful when you want to see continuous trajectories rather than discrete clumps. min_dist does not change the topology of the embedding, only its visual presentation.

n_components is the dimensionality of the embedding: 2 for figures, 3 for interactive plots, but UMAP will happily produce 10- or 50-dimensional embeddings that can be fed into downstream classifiers. UMAP as a general-purpose dimensionality reducer (not just a visualiser) is one of its quieter advantages over t-SNE, which is in practice limited to 2D and 3D.

metric chooses the distance function in the input space. Euclidean is the default. For text or image embeddings produced by a contrastive model, use cosine, Euclidean distance on unnormalised embeddings will be dominated by vector norms, which usually carry no semantic content. For binary feature vectors, use hamming or jaccard. UMAP supports around twenty metrics out of the box and will accept a custom callable.

A fifth parameter, random_state, is technically a hyperparameter too. UMAP is stochastic; setting random_state=42 (or any fixed seed) makes the embedding reproducible for a given input.

There are also two parameters that the user rarely tunes but should be aware of. n_epochs defaults to 200 for small datasets and 500 for larger ones; increasing it slightly can help when the optimisation has not visibly settled, but the returns diminish quickly. learning_rate defaults to 1.0 and almost never needs adjustment; if you find yourself reaching for it, the problem is more likely with the input scaling or the metric.

vs t-SNE

Both algorithms preserve neighbourhoods, both fail to preserve absolute distances, and both should be treated as exploratory pictures rather than evidence. The differences live at the margin and matter for everyday work.

UMAP is faster, typically 5–20× faster than t-SNE on the same data, and its negative-sampling SGD scales to a million points in a few minutes on a laptop. t-SNE, even with the Barnes-Hut approximation, becomes painful above a few hundred thousand points. UMAP also preserves more global structure, mainly because its spectral initialisation seeds the optimisation with a sensible global layout. In a t-SNE plot of the MNIST digits, the ten clusters are well separated but their relative positions are essentially random; in a UMAP plot, digits that look alike (4 and 9, 3 and 5 and 8) tend to sit near each other.

t-SNE has a longer pedigree and, in some specific cases, separates fine-grained local clusters more cleanly. The crowding-effect correction from the Student-t kernel pushes well-separated clusters far apart, which produces visually striking plots that some practitioners prefer. UMAP's clusters can look smaller and tighter, which can mask sub-structure unless n_neighbors and min_dist are tuned.

Determinism is another distinction. UMAP is not deterministic, the SGD and the negative-sample draws depend on the random seed, but it is what one might call seed-stable: the same data with the same seed produces the same embedding to within floating-point error, and different seeds produce visually similar layouts up to rotation and reflection. t-SNE is more chaotic in this respect; different seeds can give qualitatively different cluster arrangements.

A practical comparison helps. On the MNIST 70,000-image dataset, t-SNE with Barnes-Hut takes around 5 minutes on a modern laptop and produces ten clean digit clusters with random global arrangement. UMAP on the same data takes around 30 seconds and produces ten digit clusters whose positions reflect visual similarity (curvy digits near curvy digits, straight near straight). On the 1.3 million cells of the Tabula Sapiens human cell atlas, t-SNE is impractical without exotic distributed implementations; UMAP completes in under ten minutes on a single workstation and is what every paper from that consortium uses.

UMAP should be your default in 2026, t-SNE should be your sanity check, and any conclusion that depends on which algorithm you used is a conclusion you should not be drawing.

Common pitfalls

UMAP plots are easy to over-interpret. The most common mistake is to read meaningful structure into clusters that the algorithm has produced from noise. Run UMAP on Gaussian noise with n_neighbors=15 and you will get a plot with apparent clusters; the algorithm's job is to place points, and place them it will. Always verify any claim about clusters using an independent method, for instance, label the points by an external variable and check whether the colouring respects the cluster boundaries, or run a clustering algorithm directly on the high-dim data and confirm the cluster assignments match.

Distances are not preserved. The Euclidean distance between two points in a UMAP plot says almost nothing about their distance in the original space. Two points that are far apart in the embedding may be near neighbours in high-dim if the algorithm placed them on opposite sides of a topological loop, and two points that are close in the embedding may be moderately distant in the data. Densities are also not preserved, clusters that look tight in UMAP can correspond to either tight or loose populations.

Hyperparameter sensitivity is real. The same dataset rendered with n_neighbors=5 and n_neighbors=50 can produce visually different plots that tell different stories. Run UMAP at several settings and look for features that are stable across all of them; treat any feature visible at only one setting as suspect.

Stochasticity matters. Always set random_state to a fixed value when reporting an embedding. A figure in a paper that someone else cannot reproduce because the seed was unset is a small but persistent embarrassment.

Finally, beware the trap of chaining UMAP into a downstream pipeline as if it were a clean preprocessing step. UMAP coordinates are not features in any statistically meaningful sense; they are an optimisation artefact tied to the specific points used during fitting. New points cannot be projected reliably without the transform method, and even then the distortion is unpredictable. If you need stable low-dim features for a classifier, use PCA or an autoencoder; if you need a picture, use UMAP and stop there.

Where UMAP is used

In single-cell genomics, UMAP is now the default visualisation for scRNA-seq, ATAC-seq and CITE-seq data. The Human Cell Atlas and almost every Cell paper involving single-cell sequencing in the last five years uses a UMAP plot to show cell types. The trade-off, that the embedding is exploratory and the cluster boundaries are fuzzy, is well understood in the field.

In natural language processing, UMAP is the standard tool for visualising the embedding spaces of BERT, GPT, sentence-transformer and CLIP models. Word2vec analogies, BERT layer-wise probes, and clustering of LLM activations are all routinely visualised through UMAP. The cosine metric is essential here.

In computer vision, UMAP is used to visualise the penultimate-layer features of trained CNNs and Vision Transformers, both for sanity-checking that the representation has captured class structure and for finding mislabelled training examples that sit far from their class cluster.

In anomaly detection and exploratory data analysis more broadly, UMAP is a quick way to spot outliers, true outliers usually appear as isolated points or tiny clusters at the periphery of the embedding. This is not a replacement for proper anomaly-detection algorithms (§8.15), but it is a useful first pass.

In drug discovery, UMAP is used to visualise chemical fingerprint spaces, protein embedding spaces (ESM-2, AlphaFold latents) and patient phenotype spaces. The clinical genomics community has adopted it for visualising cohort heterogeneity in rare-disease studies.

In clinical informatics more broadly, UMAP is now a standard part of the exploratory toolkit for electronic health record embeddings, medical imaging feature spaces (e.g. radiology CNN penultimate layers) and patient-trajectory analyses where each patient is represented by a vector of timed events. The clinical reader should treat any UMAP plot in a paper with the same scepticism they would apply to a histogram with a hand-chosen bin width: useful for orientation, dangerous as evidence.

import umap
emb = umap.UMAP(
    n_neighbors=15,
    min_dist=0.1,
    metric="cosine",
    random_state=42,
).fit_transform(X)

What you should take away

UMAP is a fuzzy-graph embedding algorithm. Build a weighted $k$-NN graph in high-dim, build a parametric graph in low-dim, minimise binary cross-entropy between them with negative-sample SGD.
Four hyperparameters matter. n_neighbors (locality), min_dist (visual tightness), n_components (output dim), metric (distance function). Always set random_state.
UMAP beats t-SNE on speed, scale and global structure, but neither preserves absolute distances. Use UMAP as the default; check with t-SNE if a finding is important.
Do not over-interpret UMAP plots. Apparent clusters need verification with an independent method; inter-cluster distances and densities are not faithful.
UMAP is the default visualisation in single-cell genomics, NLP embedding analysis and computer-vision feature inspection. It scales to millions of points on a laptop and produces embeddings that downstream classifiers can also consume.