2.6 Eigenvalues and eigenvectors
Most of what a matrix does to a vector is a tangled mixture of stretching, rotating and shearing. Run a matrix on a thousand random vectors and you will see a thousand different outcomes, the inputs go in pointing in every direction, and the outputs come back twisted into something new. It is hard, looking at this fog of arrows, to say what the matrix really is. Yet inside the fog, every square matrix hides a small set of privileged directions on which it acts in the simplest way imaginable: it stretches them in place. It does not rotate them. It does not shear them. It just multiplies them by a number. These privileged directions are called eigenvectors, and the numbers by which they are stretched are called eigenvalues. Together they are the deep grammar of the matrix.
Imagine the matrix as a machine that takes an arrow as input and gives an arrow as output. For most arrows, the output points somewhere unrelated to the input. But there is a handful of arrows, usually no more than the dimension of the space, for which the output points along exactly the same line as the input. The machine has merely lengthened the arrow, or shortened it, or possibly flipped it back the other way. Those arrows are eigenvectors; the number that records "by how much" is the eigenvalue. An eigenvalue of two means "stretch to double length"; one half means "shrink by half"; minus one means "flip"; zero means "collapse to the origin".
Once you find a matrix's eigenvectors, the matrix becomes transparent. In ordinary coordinates a matrix mixes everything with everything else. In eigen-coordinates, coordinates aligned with the eigenvectors, the matrix becomes a list of independent multiplications, one per direction. Complicated linear algebra reduces to scalar arithmetic. This is why eigendecomposition appears in nearly every part of modern AI: in the curvature analysis of loss landscapes, in principal component analysis, in spectral clustering, in stability analysis of recurrent networks, in PageRank, and in the singular value decomposition of §2.7.
Eigenpairs are a richer invariant than the determinant and trace of §2.5. Instead of compressing the matrix to one number, they expose the full set of scaling factors and the directions on which they act. §2.7 generalises the idea to non-square matrices via the SVD.
Definition
The formal definition is compact. An eigenvector of a square matrix $\mathbf{A}$ is a non-zero vector $\mathbf{v}$ such that $$\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$$ for some scalar $\lambda$, called the eigenvalue belonging to $\mathbf{v}$. The whole subject grows from this one equation. On the left is matrix-vector multiplication, the most general linear operation we have. On the right is mere scalar multiplication. For this particular pair $(\mathbf{v}, \lambda)$, the general operation collapses into the simple one: the matrix's effect on $\mathbf{v}$ is to multiply it by a number.
Three small but important points. First, the requirement that $\mathbf{v}$ be non-zero is essential, the zero vector trivially satisfies $\mathbf{A}\mathbf{0} = \lambda\mathbf{0}$ for every $\lambda$, so admitting it would render eigenvalues undefined. Second, an eigenvector is really a direction, not a single arrow: if $\mathbf{v}$ is an eigenvector with eigenvalue $\lambda$, then so is any non-zero scalar multiple $c\mathbf{v}$, because $\mathbf{A}(c\mathbf{v}) = c\mathbf{A}\mathbf{v} = c\lambda\mathbf{v} = \lambda(c\mathbf{v})$. We usually pick a representative of unit length. Third, $\lambda$ itself can be any real or complex number, including zero.
A diagonal matrix shows the picture in its purest form. Take $$\mathbf{A} = \begin{pmatrix} 2 & 0 \\ 0 & 3 \end{pmatrix}.$$ Apply it to the vector $(1, 0)^\top$, the unit vector along the horizontal axis. The result is $\mathbf{A}(1, 0)^\top = (2, 0)^\top$, which is the same direction stretched to double length, i.e. $2 \cdot (1, 0)^\top$. So $(1, 0)^\top$ is an eigenvector with eigenvalue $\lambda = 2$. Similarly $(0, 1)^\top$ goes to $(0, 3)^\top$, telling us that the vertical axis is an eigenvector with eigenvalue $3$. The whole matrix is laid bare: it stretches by two horizontally, by three vertically, and does nothing else.
Geometrically, you can think of every matrix as a deformation of the plane (or of higher-dimensional space). It picks the unit circle up, distorts it, and puts it down as some ellipse. The eigenvectors are the directions along which the circle is not turned, the directions whose images still lie along the same line through the origin. The eigenvalues are the factors by which those directions are scaled. For our diagonal matrix, the unit circle becomes an axis-aligned ellipse with semi-axes of length two and three; the axes themselves are the eigenvectors.
Computing eigenpairs
For a non-diagonal matrix the eigenvectors are not handed to us on a plate; we must work them out. The trick is to rewrite the defining equation in a form that algebra can solve. Start from $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$ and move everything to one side, remembering that to subtract $\lambda\mathbf{v}$ we need to insert the identity matrix: $$\mathbf{A}\mathbf{v} - \lambda\mathbf{v} = (\mathbf{A} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0}.$$ We are now looking for a non-zero vector that the matrix $(\mathbf{A} - \lambda\mathbf{I})$ sends to the origin. From §2.5 we know exactly when such a vector exists: when the matrix is singular, that is, when its determinant is zero. So the eigenvalues are the values of $\lambda$ that satisfy $$\det(\mathbf{A} - \lambda\mathbf{I}) = 0.$$ This expression, viewed as a function of $\lambda$, is the characteristic polynomial of $\mathbf{A}$. It has degree $n$ (the size of the matrix) and therefore has exactly $n$ roots over the complex numbers, counted with multiplicity. Those roots are the eigenvalues.
Once we have an eigenvalue $\lambda$, the corresponding eigenvectors are the non-zero solutions of $(\mathbf{A} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0}$, a system of homogeneous linear equations. The set of all such solutions is called the eigenspace of $\lambda$.
Let us walk through a worked example carefully. Take the matrix $$\mathbf{A} = \begin{pmatrix} 4 & 1 \\ 2 & 3 \end{pmatrix}.$$ Subtracting $\lambda\mathbf{I}$ replaces the diagonal entries with $4 - \lambda$ and $3 - \lambda$ while leaving the off-diagonal entries untouched: $$\mathbf{A} - \lambda\mathbf{I} = \begin{pmatrix} 4 - \lambda & 1 \\ 2 & 3 - \lambda \end{pmatrix}.$$ For a $2 \times 2$ matrix the determinant is the top-left times bottom-right minus top-right times bottom-left: $$\det(\mathbf{A} - \lambda\mathbf{I}) = (4 - \lambda)(3 - \lambda) - (1)(2) = 12 - 4\lambda - 3\lambda + \lambda^2 - 2 = \lambda^2 - 7\lambda + 10.$$ This is the characteristic polynomial. It factorises neatly: $$\lambda^2 - 7\lambda + 10 = (\lambda - 5)(\lambda - 2).$$ The eigenvalues are $\lambda_1 = 5$ and $\lambda_2 = 2$.
To find the eigenvector associated with $\lambda_1 = 5$, substitute back and solve the homogeneous system: $$(\mathbf{A} - 5\mathbf{I})\mathbf{v} = \begin{pmatrix} -1 & 1 \\ 2 & -2 \end{pmatrix}\mathbf{v} = \mathbf{0}.$$ The first row says $-v_1 + v_2 = 0$, i.e. $v_2 = v_1$. The second row says $2v_1 - 2v_2 = 0$, the same equation in disguise, exactly as we should expect, because the matrix is singular by construction. Any vector with equal coordinates works; the simplest representative is $\mathbf{v}_1 = (1, 1)^\top$.
For $\lambda_2 = 2$: $$(\mathbf{A} - 2\mathbf{I})\mathbf{v} = \begin{pmatrix} 2 & 1 \\ 2 & 1 \end{pmatrix}\mathbf{v} = \mathbf{0}.$$ Both rows give $2v_1 + v_2 = 0$, so $v_2 = -2v_1$. A convenient representative is $\mathbf{v}_2 = (1, -2)^\top$.
It is always worth verifying. Compute $\mathbf{A}\mathbf{v}_1$ directly: $$\begin{pmatrix} 4 & 1 \\ 2 & 3 \end{pmatrix}\begin{pmatrix} 1 \\ 1 \end{pmatrix} = \begin{pmatrix} 4 + 1 \\ 2 + 3 \end{pmatrix} = \begin{pmatrix} 5 \\ 5 \end{pmatrix} = 5 \begin{pmatrix} 1 \\ 1 \end{pmatrix}. \checkmark$$ And likewise $\mathbf{A}\mathbf{v}_2 = (4 - 2, 2 - 6)^\top = (2, -4)^\top = 2 \cdot (1, -2)^\top$. The eigenpairs check out.
The recipe, write down the characteristic polynomial, factorise it, then for each root solve a homogeneous system, works perfectly for small matrices, where the polynomial is tractable. For a $3 \times 3$ matrix you face a cubic, which can still be solved by hand if you spot a rational root and then divide it out. For anything larger, hand calculation becomes unreasonable, and the polynomial roots must be computed numerically. We will return to one such method, power iteration, later in this chapter.
Properties
A handful of identities link the eigenvalues of a matrix to the simpler invariants we have already met. They are useful both as sanity checks during hand computation and as conceptual bridges.
The first identity is that the sum of the eigenvalues equals the trace: $$\sum_i \lambda_i = \text{tr}(\mathbf{A}).$$ For our worked example, $5 + 2 = 7 = 4 + 3 = \text{tr}(\mathbf{A})$. The second identity is that the product of the eigenvalues equals the determinant: $$\prod_i \lambda_i = \det(\mathbf{A}).$$ For us, $5 \times 2 = 10 = 12 - 2 = \det(\mathbf{A})$. Both identities follow from comparing coefficients of the characteristic polynomial: trace appears as the negative of the coefficient on $\lambda^{n-1}$, determinant as the constant term up to sign.
Several more facts repay memorisation. Eigenvalues of $\mathbf{A}^k$ are $\lambda_i^k$, with the same eigenvectors: applying $\mathbf{A}$ twice multiplies an eigenvector by $\lambda$ twice. By extension, eigenvalues of the inverse $\mathbf{A}^{-1}$ (when it exists) are the reciprocals $1/\lambda_i$. A matrix is singular precisely when zero is an eigenvalue, since then $\det(\mathbf{A}) = \prod_i \lambda_i = 0$. A real matrix may have complex eigenvalues, but they always come in conjugate pairs $\lambda$ and $\bar{\lambda}$, because the characteristic polynomial has real coefficients. The classical example is a $2 \times 2$ rotation matrix, whose eigenvalues are $\cos\theta \pm i\sin\theta$, purely imaginary if you like, and with no real eigenvectors at all, reflecting the fact that no real direction is preserved by a non-trivial rotation.
For a real symmetric matrix something stronger is true, and we will devote a separate subsection to it: the eigenvalues are guaranteed real, and the eigenvectors can be chosen orthogonal to one another. This rescues the geometry from the indignity of complex numbers and underwrites a great deal of practical machine learning.
Eigendecomposition
If an $n \times n$ matrix $\mathbf{A}$ has $n$ linearly independent eigenvectors $\mathbf{v}_1, \dots, \mathbf{v}_n$ with corresponding eigenvalues $\lambda_1, \dots, \lambda_n$, we can package the lot into matrix form. Stack the eigenvectors as the columns of a matrix $\mathbf{Q}$, and place the eigenvalues along the diagonal of a diagonal matrix $\boldsymbol{\Lambda}$. Then $$\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^{-1}.$$ This is the eigendecomposition, also called diagonalisation. To see why it holds, look at what each side does to a column of $\mathbf{Q}$. The right-hand side first applies $\mathbf{Q}^{-1}$, which turns $\mathbf{v}_i$ into the basis vector $\mathbf{e}_i$; then $\boldsymbol{\Lambda}$ scales $\mathbf{e}_i$ by $\lambda_i$; then $\mathbf{Q}$ turns it back into $\lambda_i\mathbf{v}_i$. The left-hand side, applying $\mathbf{A}$ directly to $\mathbf{v}_i$, also produces $\lambda_i\mathbf{v}_i$, by the definition of an eigenpair. The two sides agree on a basis, so they agree everywhere.
A more conceptual way to read the formula: $\mathbf{Q}^{-1}$ is the change of coordinates from the standard basis to the eigenbasis; $\boldsymbol{\Lambda}$ is the matrix in eigen-coordinates, where it is purely diagonal; and $\mathbf{Q}$ changes back. Eigendecomposition rewrites a complicated matrix as "go into the right coordinate frame, scale each axis independently, come back out".
For our worked example, the eigenvectors $\mathbf{v}_1 = (1, 1)^\top$ and $\mathbf{v}_2 = (1, -2)^\top$ assemble into $$\mathbf{Q} = \begin{pmatrix} 1 & 1 \\ 1 & -2 \end{pmatrix}, \qquad \boldsymbol{\Lambda} = \begin{pmatrix} 5 & 0 \\ 0 & 2 \end{pmatrix}.$$ The inverse $\mathbf{Q}^{-1}$ can be computed directly (its determinant is $-3$); multiplying out $\mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^{-1}$ recovers the original $\mathbf{A}$ exactly.
The decomposition pays off most spectacularly when you need to compute a function of $\mathbf{A}$. Take powers, for instance: $$\mathbf{A}^k = (\mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^{-1})^k = \mathbf{Q}\boldsymbol{\Lambda}^k\mathbf{Q}^{-1},$$ because the inner copies of $\mathbf{Q}^{-1}\mathbf{Q}$ collapse into the identity. Since $\boldsymbol{\Lambda}$ is diagonal, $\boldsymbol{\Lambda}^k$ is just the diagonal matrix with $\lambda_i^k$ on the diagonal, trivial to compute. Computing $\mathbf{A}^{10}$ by repeated multiplication needs nine matrix multiplications; via eigendecomposition it needs one diagonal exponentiation and two cheap matrix products. The same trick generalises to any analytic function of a matrix: $f(\mathbf{A}) = \mathbf{Q} f(\boldsymbol{\Lambda}) \mathbf{Q}^{-1}$, where $f$ is applied entry-wise to the diagonal. The matrix exponential $e^\mathbf{A}$, used in the analytical solution of linear differential equations and in continuous-time recurrent networks, is the most famous example.
Not every matrix is diagonalisable. A matrix with fewer than $n$ linearly independent eigenvectors is called defective. The canonical example is the shear $$\begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix},$$ which has $\lambda = 1$ as a double root of the characteristic polynomial but only the one-dimensional eigenspace spanned by $(1, 0)^\top$. The Jordan canonical form generalises eigendecomposition to defective matrices, but it is numerically fragile and rarely used in computational practice. In machine learning we almost always either work with diagonalisable matrices or fall back on the singular value decomposition, which exists for any matrix.
Spectral theorem (symmetric matrices)
The most useful version of eigendecomposition is reserved for symmetric matrices, those with $\mathbf{A} = \mathbf{A}^\top$. The spectral theorem states three closely linked facts. First, every real symmetric matrix has $n$ real eigenvalues, counted with multiplicity. Second, eigenvectors corresponding to distinct eigenvalues are automatically orthogonal, and within any eigenspace one can choose an orthonormal basis, so altogether the matrix admits an orthonormal basis of eigenvectors. Third, the resulting decomposition $$\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^\top$$ uses an orthogonal matrix $\mathbf{Q}$, satisfying $\mathbf{Q}^\top\mathbf{Q} = \mathbf{I}$. Notice the elegance: the inverse has been replaced by the transpose. Computing the decomposition no longer requires inverting a matrix, only transposing it.
Why does this matter so much in practice? Because the matrices that arise most often in machine learning are symmetric. Covariance matrices are symmetric by their construction as $\mathbb{E}[(\mathbf{x} - \boldsymbol{\mu})(\mathbf{x} - \boldsymbol{\mu})^\top]$. Hessians of twice-differentiable scalar loss functions are symmetric by the equality of mixed partial derivatives. Kernel matrices of valid kernels are symmetric and positive semi-definite. Graph Laplacians of undirected graphs are symmetric. In every one of these settings the spectral theorem hands us, free of charge, a real eigenvalue spectrum and an orthogonal eigenbasis, the cleanest possible structure.
A symmetric matrix is positive definite if all its eigenvalues are strictly positive, and positive semi-definite if they are all non-negative. Positive definiteness corresponds to the quadratic form $\mathbf{x}^\top \mathbf{A} \mathbf{x}$ being strictly positive for non-zero $\mathbf{x}$, which makes it a generalised notion of "positivity" for matrices. Hessians at strict local minima are positive definite; covariance matrices are positive semi-definite. The ratio $\lambda_{\max} / \lambda_{\min}$ of the largest to the smallest positive eigenvalue is the condition number, and it controls how sensitive numerical solutions are to small perturbations of the data.
Where eigenvalues appear in AI
Eigenvalues are not a niche topic, they thread through machine learning at every layer. A non-exhaustive survey:
- Hessians of loss landscapes. The Hessian at a point summarises local curvature. If all eigenvalues are positive the surface is locally a bowl; gradient descent will home in on the minimum at a rate set by the eigenvalue spectrum, with the smallest eigenvalue setting the slowest direction. If some eigenvalues are negative the point is a saddle. The eigenvalue spectrum of the Hessian therefore diagnoses the geometry of the loss and explains why optimisers like Adam, which precondition the gradient, outperform plain gradient descent on poorly conditioned problems.
- Covariance matrices and PCA. Principal component analysis (which we shall meet in §2.8) finds the eigenvectors of the data's covariance matrix. The top eigenvectors point in the directions of maximum variance; their eigenvalues record how much variance lies along each. Projecting onto the top few eigenvectors yields a low-dimensional summary that preserves most of the data's spread.
- Graph Laplacians and spectral clustering. The Laplacian matrix of an undirected graph has eigenvalues whose smallest values, near zero, encode the graph's cluster structure. Spectral clustering uses the eigenvectors corresponding to small Laplacian eigenvalues to embed nodes in a low-dimensional space where standard clustering becomes easy. Modern graph neural networks also lean on the Laplacian spectrum to define convolution-like operations on graphs.
- Power iteration. Computing eigenvalues by solving polynomials does not scale. The simplest large-matrix algorithm is power iteration: pick a random vector and repeatedly multiply by $\mathbf{A}$, normalising at each step. The result converges to the eigenvector of largest absolute eigenvalue, because that direction is amplified most aggressively by repeated multiplication. Google's original PageRank used power iteration on the web's link matrix to surface "important" pages. Variants extend the method to interior eigenvalues.
- Stability of dynamical systems. A linear recurrence $\mathbf{h}_t = \mathbf{W}\mathbf{h}_{t-1}$, the skeleton of a recurrent neural network, multiplies the hidden state by $\mathbf{W}$ at every step. After $t$ steps the state has been multiplied by $\mathbf{W}^t$, which by eigendecomposition behaves like $\boldsymbol{\Lambda}^t$ in eigen-coordinates. If the largest eigenvalue of $\mathbf{W}$ exceeds one in absolute value, the state explodes; if it is below one, the state vanishes. This is the algebraic root of the vanishing- and exploding-gradient problems that haunt RNN training and motivated the gated architectures of LSTMs and GRUs.
What you should take away
- An eigenpair $(\lambda, \mathbf{v})$ of a matrix $\mathbf{A}$ satisfies $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$ with $\mathbf{v}$ non-zero, a special direction that the matrix merely scales by a factor of $\lambda$.
- Eigenvalues are roots of the characteristic polynomial $\det(\mathbf{A} - \lambda\mathbf{I}) = 0$; eigenvectors come from solving the homogeneous system $(\mathbf{A} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0}$ for each eigenvalue.
- The trace equals the sum of eigenvalues, the determinant equals their product, and zero is an eigenvalue precisely when the matrix is singular, three handy bridges between scalar invariants and the spectrum.
- When a matrix has $n$ independent eigenvectors, $\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^{-1}$ rewrites it as a change of basis, a diagonal scaling, and a change back, making powers, exponentials and other matrix functions cheap to compute.
- For symmetric matrices the spectral theorem gives real eigenvalues and an orthonormal eigenbasis, $\mathbf{A} = \mathbf{Q}\boldsymbol{\Lambda}\mathbf{Q}^\top$. This is the version that powers PCA, Hessian analysis, kernel methods and graph spectra throughout machine learning.