Unsupervised Learning, Glossary, Textbook of AI

Unsupervised Learning confronts one of the deepest challenges in machine learning: discovering meaningful structure in data without the guidance of labels. Whereas supervised algorithms are told what to predict, unsupervised methods must infer the hidden organisation of the feature space from the data alone, identifying clusters of similar points, finding low-dimensional manifolds that capture the essential variation, or flagging observations that deviate from the expected pattern.

The principal families of unsupervised methods are clustering (k-means, hierarchical clustering, DBSCAN), which partitions data into groups of similar items; dimensionality reduction (PCA, t-SNE, UMAP, autoencoders), which projects high-dimensional data onto lower-dimensional representations; and density estimation and anomaly detection, which model the distribution of normal data in order to flag outliers. Generative models such as variational autoencoders and GANs are also unsupervised in the sense that they learn from unlabelled examples.

Unsupervised learning is particularly valuable because unlabelled data is abundant and cheap, every web page, every image on a server, every sensor reading, while labelled data is scarce and expensive. The rise of self-supervised learning in large language models and vision transformers has blurred the line further: these systems generate their own supervisory signals from the structure of the data itself, unlocking the ability to learn rich representations from enormous unlabelled corpora.

Discussed in:

Chapter 8: Unsupervised Learning, 8.1 K-Means Clustering

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.