Unsupervised Learning: 8.10   t-SNE

Dr Chris Paton

8.10 t-SNE

t-distributed stochastic neighbour embedding (t-SNE) Maaten, 2008 is a non-linear dimensionality reduction technique designed for visualisation. It preserves local neighbour relationships at the cost of distorting global structure.

8.10.1 The objective

In the input space, define the conditional probability that point $j$ is a neighbour of point $i$:

$$ p_{j\mid i} = \frac{\exp(-\lVert\mathbf{x}^{(i)}-\mathbf{x}^{(j)}\rVert^2/(2\sigma_i^2))}{\sum_{k\neq i}\exp(-\lVert\mathbf{x}^{(i)}-\mathbf{x}^{(k)}\rVert^2/(2\sigma_i^2))}. $$

The bandwidth $\sigma_i$ is chosen per point so that the entropy of $p_{\cdot\mid i}$ equals $\log_2(\text{Perp})$ for a user-specified perplexity (typical values 5--50). Symmetrise: $p_{ij} = (p_{j\mid i}+p_{i\mid j})/(2n)$.

In the embedding space $\mathbb{R}^2$ or $\mathbb{R}^3$, define neighbour probabilities using a heavy-tailed Student-t with one degree of freedom (Cauchy):

$$ q_{ij} = \frac{(1+\lVert\mathbf{y}^{(i)}-\mathbf{y}^{(j)}\rVert^2)^{-1}}{\sum_{k\neq\ell}(1+\lVert\mathbf{y}^{(k)}-\mathbf{y}^{(\ell)}\rVert^2)^{-1}}. $$

Minimise the Kullback-Leibler divergence

$$ \mathrm{KL}(P\,\Vert\,Q) = \sum_{i\neq j}p_{ij}\log\frac{p_{ij}}{q_{ij}} $$

by gradient descent in $\mathbf{y}^{(i)}$.

8.10.2 Why heavy tails

The Student-t in the low-dimensional space gives points with low similarity in $\mathbb{R}^2$ a higher $q_{ij}$ than a Gaussian would, allowing dissimilar clusters to lie far apart without the optimisation pulling them together. This crowding effect correction is t-SNE's central innovation over earlier SNE.

8.10.3 Hyperparameter pitfalls

t-SNE plots are notoriously easy to over-interpret.

Cluster sizes are not faithful. Tight clusters in the embedding can correspond to either tight or loose clusters in the data.
Inter-cluster distances are not faithful. Two well-separated clusters in the embedding can be either far apart or moderately close in the data.
Perplexity matters. Low perplexity emphasises micro-structure; high perplexity smooths it. Always run several perplexities (e.g. 5, 30, 100) and look for stable features.
Random seed matters. Different initialisations give different layouts.
Outliers can produce ghost clusters. A few isolated points may be artificially clustered by the algorithm's repulsive force balance.

The Distill article How to use t-SNE effectively (Wattenberg, Viégas, Johnson 2016) shows several pathological examples.

from sklearn.manifold import TSNE
emb = TSNE(n_components=2, perplexity=30, init="pca", random_state=0).fit_transform(X)

The Barnes-Hut approximation reduces the cost from $O(n^2)$ to $O(n\log n)$ at the price of an approximation parameter $\theta$ (default 0.5).