Keep the top singular values and a matrix becomes a sum of rank-one terms.
From the chapter: Chapter 2: Linear Algebra
Glossary: singular value decomposition, low rank approximation
Transcript
The singular value decomposition writes any matrix M as U times Sigma times V transposed.
U and V are orthogonal. Sigma is diagonal, with non-negative singular values, ordered largest first.
Each singular value scales a rank-one piece: a column of U times a row of V transposed. The matrix is a sum of these pieces, weighted by the singular values.
Big singular values capture most of the matrix's structure. Small ones capture noise or fine detail.
Truncate. Keep only the top k singular values, set the rest to zero. The result is the best rank-k approximation of M, in least-squares terms.
For an image, k equals fifty captures most of what your eye sees. The same holds for a word embedding matrix, a recommendation matrix, or the activations inside a transformer.
Low-rank approximation is everywhere. PCA is SVD on a centred data matrix. LoRA fine-tunes large language models with rank-eight or rank-sixteen updates. Compression, denoising, and feature extraction all rely on the same principle: most of the signal lives in the top few singular vectors.