Linear Algebra: 2.7   The singular value decomposition

Dr Chris Paton

2.7 The singular value decomposition

If you were allowed only one tool from linear algebra to take into the world of artificial intelligence, the singular value decomposition would be the right choice. Almost every other technique we shall meet in this book either uses the SVD directly, leans on a result that the SVD makes obvious, or is a special case of something the SVD does in full generality. It compresses images, finds the dominant patterns in a dataset, powers recommendation engines, underpins principal component analysis, makes least-squares regression numerically safe, and lies at the heart of the modern parameter-efficient fine-tuning method LoRA that is used to adapt enormous language models on consumer hardware. The SVD is not a clever trick reserved for special matrices: it works for absolutely any matrix, square or rectangular, full-rank or rank-deficient, well-conditioned or pathological. That universality is what makes it so useful.

A matrix is a recipe for turning one list of numbers into another by mixing the inputs with weights. The SVD reads that recipe and tells you the matrix's intrinsic geometry: which directions in the input matter most, which directions in the output they get pushed into, and how much stretching happens along the way. Once you know the geometry, you can throw away the directions that contribute almost nothing and keep a far smaller approximation that captures the essential behaviour. Keep the big stretches, discard the small ones, turns out to be one of the most fruitful ideas in computational science.

A second motivation: many of the matrices that turn up in modern machine learning are gigantic but secretly simple. A user-by-film ratings table from a streaming service may have a hundred million rows and ten thousand columns, yet most users can be described by a handful of taste vectors and most films by a handful of genre vectors. A weight matrix in a fine-tuned language model nominally has hundreds of millions of parameters, yet the change induced by fine-tuning a small task is concentrated in a handful of directions. The SVD detects and exploits this hidden simplicity, exposing the small number of directions that do all the work.

The SVD generalises the eigendecomposition of §2.6 in two ways. First, it works for rectangular matrices. Second, it works even when the eigendecomposition would fail because the matrix lacks enough eigenvectors. §2.8 builds principal component analysis on top of it.

Symbols Used Here

$\mathbf{A}$matrix in $\mathbb{R}^{m \times n}$

$\mathbf{U}$left singular vectors, orthogonal matrix in $\mathbb{R}^{m \times m}$

$\boldsymbol{\Sigma}$diagonal matrix in $\mathbb{R}^{m \times n}$ with non-negative singular values

$\mathbf{V}$right singular vectors, orthogonal matrix in $\mathbb{R}^{n \times n}$

$\sigma_i$singular values, $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$

$r$rank of $\mathbf{A}$

$\mathbf{u}_i$$\mathbf{v}_i$, columns of $\mathbf{U}$, $\mathbf{V}$

The decomposition

Every real matrix $\mathbf{A}$ of size $m \times n$ can be written as a product of three pieces:

$$\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top.$$

Read this aloud as "A equals U sigma V transpose". The matrix $\mathbf{U}$ is square of size $m \times m$ and orthogonal, meaning its columns are unit-length and mutually perpendicular, so $\mathbf{U}^\top\mathbf{U} = \mathbf{I}$. The matrix $\mathbf{V}$ is square of size $n \times n$ and likewise orthogonal. The matrix $\boldsymbol{\Sigma}$ in the middle is rectangular, the same shape as $\mathbf{A}$, and is zero everywhere except on its main diagonal, where it carries non-negative numbers $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$ called the singular values. By convention they are sorted in descending order, so $\sigma_1$ is the largest. The number of strictly positive singular values equals the rank $r$ of $\mathbf{A}$, that is, the number of linearly independent columns it contains.

It is helpful to write the same fact a second way, as a sum of simple ingredients:

$$\mathbf{A} = \sum_{i=1}^r \sigma_i \mathbf{u}_i \mathbf{v}_i^\top.$$

Each term $\mathbf{u}_i \mathbf{v}_i^\top$ is an outer product, a matrix of rank one whose columns are all multiples of $\mathbf{u}_i$ and whose rows are all multiples of $\mathbf{v}_i^\top$. The singular value $\sigma_i$ is the weight that says how much of this rank-one ingredient is present in $\mathbf{A}$. So an arbitrary matrix is a weighted sum of $r$ rank-one matrices, and the weights are the singular values, listed in order of importance. This rank-one expansion is the key to low-rank approximation, which we shall meet shortly.

The geometric reading is the cleanest picture in linear algebra and worth committing to memory. Any linear map $\mathbf{x} \mapsto \mathbf{A}\mathbf{x}$ can be carried out in three stages. First, $\mathbf{V}^\top$ rotates (or reflects) the input in $\mathbb{R}^n$ without stretching it. Second, $\boldsymbol{\Sigma}$ scales each axis by a non-negative factor, possibly squashing some axes flat to zero, and possibly producing an output of a different dimension. Third, $\mathbf{U}$ rotates the result inside $\mathbb{R}^m$ to its final position. If you draw the unit ball in the input space and watch what $\mathbf{A}$ does to it, the ball comes out as an ellipsoid in the output space whose semi-axis lengths are the singular values $\sigma_i$ and whose semi-axis directions are the left singular vectors $\mathbf{u}_i$. There is no cleverer description: every linear map is rotation, then per-axis stretch, then rotation, full stop.

Worked example: a $2 \times 2$ matrix

Concrete numbers help. Take

$$\mathbf{A} = \begin{pmatrix} 3 & 0 \\ 4 & 5 \end{pmatrix}.$$

Our plan is to find the SVD by hand. The recipe is to compute $\mathbf{A}^\top\mathbf{A}$, find its eigenvalues and eigenvectors, take square roots of the eigenvalues to get the singular values, use the eigenvectors as the columns of $\mathbf{V}$, and finally recover $\mathbf{U}$ from the formula $\mathbf{u}_i = \mathbf{A}\mathbf{v}_i / \sigma_i$.

Step one is the matrix product

$$\mathbf{A}^\top\mathbf{A} = \begin{pmatrix} 3 & 4 \\ 0 & 5 \end{pmatrix}\begin{pmatrix} 3 & 0 \\ 4 & 5 \end{pmatrix} = \begin{pmatrix} 25 & 20 \\ 20 & 25 \end{pmatrix}.$$

Notice that this matrix is symmetric, it equals its own transpose, which is no accident: $\mathbf{A}^\top\mathbf{A}$ is symmetric for any rectangular $\mathbf{A}$, because $(\mathbf{A}^\top\mathbf{A})^\top = \mathbf{A}^\top(\mathbf{A}^\top)^\top = \mathbf{A}^\top\mathbf{A}$. Symmetric matrices are exactly the ones with real eigenvalues and orthogonal eigenvectors, which is why this construction works.

Step two is to find the eigenvalues. The characteristic equation is

$$\det\!\begin{pmatrix} 25 - \lambda & 20 \\ 20 & 25 - \lambda \end{pmatrix} = (25 - \lambda)^2 - 400 = 0,$$

so $25 - \lambda = \pm 20$, giving $\lambda_1 = 45$ and $\lambda_2 = 5$. Taking square roots gives the singular values:

$$\sigma_1 = \sqrt{45} = 3\sqrt{5} \approx 6.708, \qquad \sigma_2 = \sqrt{5} \approx 2.236.$$

These two numbers tell you, in advance of doing anything else, how much $\mathbf{A}$ stretches along its two principal directions. The largest possible stretch any unit vector can suffer when multiplied by $\mathbf{A}$ is $\sigma_1 \approx 6.708$, and the smallest is $\sigma_2 \approx 2.236$.

Step three is to find the eigenvectors of $\mathbf{A}^\top\mathbf{A}$, which become the columns of $\mathbf{V}$. For $\lambda_1 = 45$ we solve $(25 - 45)v_1 + 20 v_2 = 0$, that is $-20 v_1 + 20 v_2 = 0$, so $v_1 = v_2$. Normalising to unit length gives

$$\mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ 1 \end{pmatrix}.$$

For $\lambda_2 = 5$ we solve $20 v_1 + 20 v_2 = 0$, so $v_1 = -v_2$, giving

$$\mathbf{v}_2 = \frac{1}{\sqrt{2}}\begin{pmatrix} 1 \\ -1 \end{pmatrix}.$$

Step four is to recover the left singular vectors using $\mathbf{u}_i = \mathbf{A}\mathbf{v}_i / \sigma_i$. The first calculation is

$$\mathbf{A}\mathbf{v}_1 = \frac{1}{\sqrt{2}}\begin{pmatrix} 3 & 0 \\ 4 & 5 \end{pmatrix}\begin{pmatrix} 1 \\ 1 \end{pmatrix} = \frac{1}{\sqrt{2}}\begin{pmatrix} 3 \\ 9 \end{pmatrix},$$

and dividing by $\sigma_1 = 3\sqrt{5}$ gives

$$\mathbf{u}_1 = \frac{1}{3\sqrt{5}\cdot\sqrt{2}}\begin{pmatrix} 3 \\ 9 \end{pmatrix} = \frac{1}{\sqrt{10}}\begin{pmatrix} 1 \\ 3 \end{pmatrix}.$$

Similarly $\mathbf{A}\mathbf{v}_2 = \tfrac{1}{\sqrt{2}}(3, -1)^\top$, and dividing by $\sigma_2 = \sqrt{5}$ yields

$$\mathbf{u}_2 = \frac{1}{\sqrt{10}}\begin{pmatrix} 3 \\ -1 \end{pmatrix}.$$

You can check that $\mathbf{u}_1$ and $\mathbf{u}_2$ are unit length and perpendicular, that $\mathbf{v}_1$ and $\mathbf{v}_2$ are also unit length and perpendicular, and that

$$\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top = \frac{1}{\sqrt{10}}\begin{pmatrix} 1 & 3 \\ 3 & -1 \end{pmatrix}\begin{pmatrix} 3\sqrt{5} & 0 \\ 0 & \sqrt{5} \end{pmatrix}\frac{1}{\sqrt{2}}\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$$

multiplies out to recover the original $\mathbf{A}$. Doing this verification once with pencil and paper is a rite of passage.

Connection to eigendecomposition

The recipe we just used is not a coincidence. It follows from a short calculation. Substitute the SVD into $\mathbf{A}^\top\mathbf{A}$:

$$\mathbf{A}^\top\mathbf{A} = (\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top)^\top(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top) = \mathbf{V}\boldsymbol{\Sigma}^\top\mathbf{U}^\top\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top = \mathbf{V}(\boldsymbol{\Sigma}^\top\boldsymbol{\Sigma})\mathbf{V}^\top.$$

The middle factor $\boldsymbol{\Sigma}^\top\boldsymbol{\Sigma}$ is a diagonal matrix whose entries are $\sigma_i^2$. So this last expression is exactly an eigendecomposition of $\mathbf{A}^\top\mathbf{A}$: the columns of $\mathbf{V}$ are its eigenvectors, and the eigenvalues are the squared singular values.

By the same manoeuvre applied to $\mathbf{A}\mathbf{A}^\top$, we get $\mathbf{A}\mathbf{A}^\top = \mathbf{U}(\boldsymbol{\Sigma}\boldsymbol{\Sigma}^\top)\mathbf{U}^\top$, so the columns of $\mathbf{U}$ are eigenvectors of $\mathbf{A}\mathbf{A}^\top$ with the same squared singular values as eigenvalues. Three statements are therefore equivalent, and each illuminates the others:

The singular values of $\mathbf{A}$ are the (non-negative) square roots of the eigenvalues of $\mathbf{A}^\top\mathbf{A}$, equivalently of $\mathbf{A}\mathbf{A}^\top$.
The right singular vectors of $\mathbf{A}$ are the eigenvectors of $\mathbf{A}^\top\mathbf{A}$.
The left singular vectors of $\mathbf{A}$ are the eigenvectors of $\mathbf{A}\mathbf{A}^\top$.

In short, computing the SVD of any matrix reduces to performing the eigendecomposition of one of two related symmetric, positive semi-definite matrices. In practice you would not actually compute the SVD this way on a computer, because forming $\mathbf{A}^\top\mathbf{A}$ destroys precision, but as a way to understand what the SVD is, the link is illuminating: it says that the SVD is just an eigendecomposition done carefully so that it works for rectangular matrices.

Low-rank approximation: the Eckart–Young theorem

Now comes the result that turns the SVD from a piece of mathematical elegance into one of the most useful ideas in applied science. Suppose you have a large matrix $\mathbf{A}$ and you want a smaller, simpler matrix $\mathbf{B}$ of rank at most $k$ that approximates $\mathbf{A}$ as closely as possible. There are infinitely many candidate $\mathbf{B}$s. Which one is best?

The Eckart–Young theorem gives a clean answer. Take the SVD of $\mathbf{A}$ and keep only the largest $k$ singular values, throwing the rest away. That is, build

$$\mathbf{A}_k = \sum_{i=1}^k \sigma_i \mathbf{u}_i \mathbf{v}_i^\top.$$

Then $\mathbf{A}_k$ is the best rank-$k$ approximation to $\mathbf{A}$ in both the Frobenius norm (sum of squared entries) and the spectral norm (largest singular value). Concretely:

$$\|\mathbf{A} - \mathbf{A}_k\|_F^2 = \sum_{i=k+1}^r \sigma_i^2, \qquad \|\mathbf{A} - \mathbf{A}_k\|_2 = \sigma_{k+1},$$

and no other rank-$k$ matrix can do better.

The practical consequence is enormous. Whenever the singular values of $\mathbf{A}$ decay quickly, that is, $\sigma_1$ is much larger than $\sigma_{10}$, which is much larger than $\sigma_{100}$, most of the matrix's "energy" lives in the first handful of components, and you can throw the rest away with negligible loss. Imagine a $100 \times 100$ matrix whose first ten singular values are 100, 90, 80, …, 10 and whose remaining ninety singular values are all about 0.01. Storing $\mathbf{A}$ raw costs ten thousand numbers. The rank-10 approximation costs $10 \times (100 + 100 + 1) = 2{,}010$ numbers, a fivefold saving, and the squared Frobenius error is $90 \times 0.01^2 = 0.009$, while $\|\mathbf{A}\|_F^2$ is roughly $38{,}500$. The relative error is about one part in four million.

Because most matrices that arise from real-world data have this rapid decay, photographs, term–document counts, user–item rating tables, gene-expression profiles, audio spectrograms, truncated SVD is a one-line recipe for compression, denoising, and feature extraction. In a photograph, the first few singular values capture overall light and shadow, the next few capture broad shapes, and only the long tail encodes fine texture and noise. Drop the tail and you keep what matters.

Pseudo-inverse

For a square invertible matrix, $\mathbf{A}^{-1}$ undoes the action of $\mathbf{A}$. For rectangular or singular matrices, no exact inverse exists, but the SVD gives us the next best thing, the Moore–Penrose pseudo-inverse

$$\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^\top,$$

where $\boldsymbol{\Sigma}^+$ is built from $\boldsymbol{\Sigma}$ by transposing it and replacing each non-zero $\sigma_i$ on the diagonal with $1/\sigma_i$ while leaving the zeros alone. The pseudo-inverse coincides with the ordinary inverse when $\mathbf{A}$ is square and invertible, and reduces to the familiar formulas $\mathbf{A}^+ = (\mathbf{A}^\top\mathbf{A})^{-1}\mathbf{A}^\top$ for tall full-rank matrices and $\mathbf{A}^+ = \mathbf{A}^\top(\mathbf{A}\mathbf{A}^\top)^{-1}$ for wide full-rank matrices.

The pseudo-inverse solves the least-squares problem: among all $\mathbf{x}$ that minimise $\|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2$, the unique one of smallest norm is

$$\mathbf{x} = \mathbf{A}^+\mathbf{b}.$$

This formulation is far more numerically stable than forming the normal equations $\mathbf{A}^\top\mathbf{A}\mathbf{x} = \mathbf{A}^\top\mathbf{b}$ and solving directly, because the condition number of $\mathbf{A}^\top\mathbf{A}$ is the square of the condition number of $\mathbf{A}$, so squaring is a recipe for losing precision when $\mathbf{A}$ is poorly conditioned. The SVD-based route keeps everything at the precision of $\mathbf{A}$ itself. Whenever a numerical library asks you to solve an over-determined or rank-deficient system and offers you a "least-squares" routine, what it usually does behind the scenes is exactly an SVD followed by application of $\boldsymbol{\Sigma}^+$.

Where SVD appears in AI

The SVD turns up in so many corners of artificial intelligence that one almost suspects the field of having been arranged around it. Some prominent appearances:

Principal component analysis (the subject of §2.8). You centre your data matrix by subtracting the mean of each feature, take its SVD, and the right singular vectors are the principal components, the directions along which the data varies most. PCA is the simplest and most widely used dimensionality reduction technique, and it is one SVD away from a raw spreadsheet.
Latent semantic indexing. Build a matrix whose rows are documents, whose columns are vocabulary terms, and whose entries count how often each term appears in each document. A truncated SVD reveals a small number of "semantic dimensions" along which documents and terms align, so that documents with no words in common but similar meanings end up close together. This was a precursor to modern dense embeddings, and the same arithmetic still underpins many information-retrieval systems.
Recommendation systems. Build a user-by-item matrix whose entries are ratings, mostly missing. A low-rank SVD-style factorisation imputes the missing entries by assuming that user preferences and item attributes both live on a low-dimensional manifold. The Netflix Prize was won using exactly this idea.
LoRA (Low-Rank Adaptation). When fine-tuning a language model with billions of parameters, updating every weight matrix is costly. LoRA fixes the original weights $\mathbf{W}$ in place and adds a low-rank correction $\mathbf{A}\mathbf{B}$, where $\mathbf{A}$ is tall and thin and $\mathbf{B}$ is short and fat, so the product has rank at most a small number such as 8 or 16. This reduces trainable parameters by factors of hundreds or thousands while preserving most of the adaptation quality. The motivation comes straight from the Eckart–Young theorem: the change in weights during fine-tuning is empirically low-rank, so a low-rank parameterisation suffices.
Spectral analysis of weight matrices. Researchers diagnose neural-network pathologies, rank collapse, dead neurons, exploding sensitivity, by inspecting the singular value spectra of the weight matrices. Techniques such as spectral normalisation clip the largest singular value to control the Lipschitz constant of a layer, which stabilises training of generative adversarial networks.
Image and audio compression. Truncated SVD on an image matrix gives drastic storage savings for very small visual loss: keeping rank 50 on a $1024 \times 1024$ photograph typically reproduces the perceptually salient structure at roughly a tenth of the raw pixel storage. The same idea applied to weight pruning explains why over-parameterised neural networks can be shrunk dramatically with little accuracy loss.
Word embeddings and language models. The classical word-embedding methods of the 2010s, such as GloVe, can be derived as a low-rank SVD-style factorisation of a shifted and weighted word co-occurrence matrix. Even modern transformer attention can be analysed through the singular value spectra of its query, key and value projections, a perspective that has guided architectural improvements such as low-rank attention approximations and rotary positional encodings.
Solving inverse problems and signal recovery. Whenever a measurement model takes the form $\mathbf{y} = \mathbf{A}\mathbf{x} + \boldsymbol{\varepsilon}$ with noise $\boldsymbol{\varepsilon}$ and a poorly conditioned $\mathbf{A}$, naive inversion amplifies noise along directions where the singular values are small. Truncated SVD or Tikhonov regularisation, both built on the singular value spectrum, give stable estimates by suppressing the unreliable directions. This shows up in medical imaging reconstruction, astronomy, and seismology.

What you should take away

Every real matrix has an SVD $\mathbf{A} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$, with $\mathbf{U}$ and $\mathbf{V}$ orthogonal and $\boldsymbol{\Sigma}$ diagonal with non-negative entries, no exceptions.
Geometrically, every linear map is rotation, then per-axis stretch by the singular values, then rotation: that is the whole story.
The SVD reduces to the eigendecomposition of $\mathbf{A}^\top\mathbf{A}$ (eigenvalues are $\sigma_i^2$, eigenvectors are columns of $\mathbf{V}$), which is how we can compute it by hand on small matrices.
The Eckart–Young theorem says that truncating to the top $k$ singular values gives the unique best rank-$k$ approximation; this single fact powers PCA, image compression, latent semantic indexing, recommender systems, and LoRA.
The pseudo-inverse $\mathbf{A}^+ = \mathbf{V}\boldsymbol{\Sigma}^+\mathbf{U}^\top$ generalises matrix inversion to any matrix and gives the numerically stable minimum-norm least-squares solution to $\mathbf{A}\mathbf{x} = \mathbf{b}$.