Linear Algebra: 2.4a   Matrix factorisations: LU, Cholesky, QR

Dr Chris Paton

2.4a Matrix factorisations: LU, Cholesky, QR

Almost every quantitative problem in artificial intelligence eventually reduces to the same humble question: given a matrix $\mathbf{A}$ and a vector $\mathbf{b}$, find an $\mathbf{x}$ that satisfies $\mathbf{A}\mathbf{x} = \mathbf{b}$. Training a linear regression, fitting the final layer of a neural network with a closed-form solver, computing posterior means in a Bayesian model, propagating uncertainty through a Gaussian process, taking a Newton step in optimisation, projecting one subspace onto another, all of these end the same way, with someone, somewhere, asking a computer to solve a linear system. The naive answer (compute $\mathbf{A}^{-1}$ and multiply) is almost always the wrong one. It is more expensive than necessary, more numerically fragile than necessary, and it throws away the structure that makes the problem tractable in the first place.

The right answer is to factorise the matrix. A factorisation rewrites $\mathbf{A}$ as a product of simpler matrices (triangular, orthogonal, diagonal, or some combination), and once that decomposition is in hand, solves become almost free. The intuition is the same one that runs through prime factorisation in arithmetic: a number broken into its prime factors reveals divisibility, common factors, and shortcuts to multiplication. A matrix broken into structured factors reveals invertibility, conditioning, geometry, and shortcuts to solves.

Three factorisations dominate practical numerical linear algebra: LU for general invertible matrices, Cholesky for the symmetric positive-definite ones that arise from covariances and Hessians, and QR for orthogonalisation problems and least squares. Section 2.4 introduced the structural ideas of rank and the four fundamental subspaces; this section turns those structural facts into algorithms. The same machinery powers least squares (§2.7), Gaussian process inference, second-order optimisation, the closed-form solvers used inside scikit-learn's LinearRegression and Ridge, and the LAPACK routines wrapped by NumPy, SciPy, PyTorch, and JAX. Three small ideas, three short algorithms, and much of modern computational practice falls out of them.

Symbols Used Here

$\mathbf{A}, \mathbf{B}$square matrices

$\mathbf{L}$lower-triangular matrix

$\mathbf{U}$upper-triangular matrix (in LU and QR contexts)

$\mathbf{P}$permutation matrix (PA = LU)

$\mathbf{Q}$orthogonal matrix

$\mathbf{R}$upper-triangular matrix (in QR)

$\mathbf{D}$diagonal matrix

LU decomposition

The LU decomposition is Gaussian elimination with bookkeeping. For any invertible matrix $\mathbf{A}$ we can write $$\mathbf{P}\mathbf{A} = \mathbf{L}\mathbf{U},$$ where $\mathbf{L}$ is unit lower-triangular (ones on the diagonal, zeros above), $\mathbf{U}$ is upper-triangular (zeros below the diagonal), and $\mathbf{P}$ is a permutation matrix that records any row swaps used to keep the arithmetic stable. The elimination steps you carried out by hand at school, subtract this multiple of row 1 from row 2, then this multiple of row 1 from row 3, and so on, are precisely what populate $\mathbf{L}$ and $\mathbf{U}$. The row-echelon form you arrive at is $\mathbf{U}$. The multipliers you used along the way, slotted into the lower triangle, give $\mathbf{L}$.

Why bother packaging elimination in this way? Because once you possess the factorisation, the solve $\mathbf{A}\mathbf{x} = \mathbf{b}$ becomes two cheap triangular solves:

Apply the permutation: form $\mathbf{P}\mathbf{b}$.
Forward-substitute on $\mathbf{L}\mathbf{y} = \mathbf{P}\mathbf{b}$. Because $\mathbf{L}$ is lower-triangular, the first equation determines $y_1$ immediately, the second determines $y_2$ given $y_1$, and so on. The cost is $O(n^2)$.
Back-substitute on $\mathbf{U}\mathbf{x} = \mathbf{y}$. Symmetrically, the last equation determines $x_n$, the second-to-last determines $x_{n-1}$, and so on, again at $O(n^2)$.

Computing the factorisation itself is $O(n^3)$, the same asymptotic cost as inverting $\mathbf{A}$, but with a smaller constant and far better numerical behaviour. The payoff is that additional right-hand sides come essentially free: factorise once, then handle each new $\mathbf{b}$ in $O(n^2)$. In machine learning this matters whenever the same operator acts on many vectors, for example when computing predictions for a batch of new test points using a fixed regression model, or when solving the same Hessian system across multiple Newton steps.

The permutation matrix $\mathbf{P}$ is not cosmetic. Without it, encountering a small or zero pivot during elimination would either crash the algorithm or amplify rounding error catastrophically. Partial pivoting swaps rows at each step so that the largest available entry sits on the diagonal, keeping the multipliers in $\mathbf{L}$ bounded by one in magnitude. The result is the workhorse algorithm, Gaussian elimination with partial pivoting, that has powered scientific computing since the 1950s.

A small worked example makes the procedure concrete. Take $$\mathbf{A} = \begin{pmatrix} 2 & 1 & 1 \\ 4 & 3 & 3 \\ 8 & 7 & 9 \end{pmatrix}.$$ Eliminate below the first pivot: row 2 minus 2·row 1 gives $(0, 1, 1)$; row 3 minus 4·row 1 gives $(0, 3, 5)$. Now eliminate below the second pivot: row 3 minus 3·row 2 gives $(0, 0, 2)$. The upper factor is $$\mathbf{U} = \begin{pmatrix} 2 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 2 \end{pmatrix},$$ and the multipliers we used (2, 4, and 3) populate the strict lower triangle of $$\mathbf{L} = \begin{pmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 4 & 3 & 1 \end{pmatrix}.$$ No row swaps were required, so $\mathbf{P} = \mathbf{I}$. As a side benefit, the determinant of $\mathbf{A}$ falls out for free: $\det \mathbf{A} = (\det \mathbf{P})(\det \mathbf{L})(\det \mathbf{U}) = 1 \cdot 1 \cdot (2 \cdot 1 \cdot 2) = 4$. This is exactly how np.linalg.det evaluates determinants behind the scenes.

Cholesky decomposition

When the matrix you are working with is symmetric positive-definite, meaning $\mathbf{A} = \mathbf{A}^\top$ and $\mathbf{x}^\top \mathbf{A} \mathbf{x} > 0$ for every non-zero $\mathbf{x}$, the LU decomposition collapses into something stronger and cheaper. The Cholesky factorisation is $$\mathbf{A} = \mathbf{L}\mathbf{L}^\top,$$ where $\mathbf{L}$ is lower-triangular with strictly positive diagonal entries. Symmetry means we no longer need a separate $\mathbf{U}$: it is just $\mathbf{L}^\top$, encoding the same numbers transposed. Positive-definiteness guarantees that every diagonal entry of $\mathbf{L}$ is the square root of a positive quantity, so no pivoting strategy is needed to keep the algorithm stable. The result is a factorisation that costs roughly half that of LU (one triangular factor instead of two), is uniquely determined, and is a numerical workhorse.

Symmetric positive-definite matrices are not exotic. They are the matrices of inner products, of covariances, and of the second derivatives of convex functions. So they appear constantly in AI, every time you fit a Gaussian, take a Newton step on a convex loss, or compute a kernel matrix.

Take a concrete $2 \times 2$ example: $$\mathbf{A} = \begin{pmatrix} 4 & 2 \\ 2 & 5 \end{pmatrix}.$$ The recipe reads off entry-by-entry. The top-left entry of $\mathbf{L}$ is $L_{11} = \sqrt{A_{11}} = \sqrt{4} = 2$. The off-diagonal is $L_{21} = A_{21}/L_{11} = 2/2 = 1$. The bottom-right is $L_{22} = \sqrt{A_{22} - L_{21}^2} = \sqrt{5 - 1} = 2$. So $$\mathbf{L} = \begin{pmatrix} 2 & 0 \\ 1 & 2 \end{pmatrix}.$$ Verify by multiplying out: $$\mathbf{L}\mathbf{L}^\top = \begin{pmatrix} 2 & 0 \\ 1 & 2 \end{pmatrix}\begin{pmatrix} 2 & 1 \\ 0 & 2 \end{pmatrix} = \begin{pmatrix} 4 & 2 \\ 2 & 5 \end{pmatrix} = \mathbf{A}. \checkmark$$

Three uses dominate. Gaussian process inference spends most of its time factorising kernel matrices, because the predictive mean and variance both reduce to triangular solves once a Cholesky is in hand; on a dataset of $n$ points the dominant cost of GP regression is the single $O(n^3)$ Cholesky of the kernel matrix. Bayesian linear regression and other conjugate-Gaussian models rely on the same trick to compute posterior means and credible intervals. Multivariate Gaussian sampling uses the elegant identity that if $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ then $\boldsymbol{\mu} + \mathbf{L}\mathbf{z} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ whenever $\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}^\top$. Drawing correlated Gaussian samples therefore costs one Cholesky plus a matrix-vector product. This identity is also why diffusion models, normalising flows, and many variational-inference routines treat the Cholesky factor of a covariance as a first-class object: it is the bridge between independent noise and structured noise.

A practical note worth flagging early: the Cholesky algorithm doubles as a positive-definiteness test. If at any step the algorithm tries to take the square root of a non-positive number, the matrix is not positive-definite. This is how scipy.linalg.cholesky reports the failure, and it is often the first signal that a covariance estimate has gone wrong, perhaps because of accumulated round-off or because the matrix really is rank-deficient. The standard remedy is to add a small jitter $\varepsilon \mathbf{I}$ to the diagonal, a trick used routinely inside Gaussian-process libraries.

QR decomposition

The third workhorse handles a different problem: orthogonalisation. For any matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ with $m \ge n$, the QR factorisation writes $$\mathbf{A} = \mathbf{Q}\mathbf{R},$$ where $\mathbf{Q}$ has orthonormal columns (so $\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}$) and $\mathbf{R}$ is upper-triangular. Geometrically, the columns of $\mathbf{Q}$ are an orthonormal basis for the column space of $\mathbf{A}$, while $\mathbf{R}$ records the coefficients that re-express the original columns in that basis. Three algorithms produce QR factorisations: classical Gram–Schmidt orthogonalisation (intuitive but numerically delicate), Householder reflections (the standard, used inside LAPACK), and Givens rotations (preferred for sparse problems and incremental updates).

A small example shows the Gram–Schmidt mechanics. Take $$\mathbf{A} = \begin{pmatrix} 1 & 1 \\ 0 & 1 \end{pmatrix}.$$ The first column is already a unit vector, so $\mathbf{q}_1 = (1, 0)^\top$ and $R_{11} = 1$. Project the second column onto $\mathbf{q}_1$: $R_{12} = \mathbf{q}_1^\top (1, 1)^\top = 1$. Subtract that projection: $(1, 1)^\top - 1 \cdot (1, 0)^\top = (0, 1)^\top$, which already has unit length, so $\mathbf{q}_2 = (0, 1)^\top$ and $R_{22} = 1$. The factorisation is $\mathbf{Q} = \mathbf{I}$ and $\mathbf{R} = \mathbf{A}$, unsurprising, given that $\mathbf{A}$ was already upper-triangular with orthonormal columns.

The reason QR matters in machine learning is least squares. Given a tall design matrix $\mathbf{A}$ and an observation vector $\mathbf{b}$, the regression problem $\min \|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2$ has the textbook closed form $\mathbf{x} = (\mathbf{A}^\top \mathbf{A})^{-1} \mathbf{A}^\top \mathbf{b}$ via the normal equations. Lovely on paper, dangerous on a computer: forming $\mathbf{A}^\top \mathbf{A}$ squares the condition number, so any noise or near-collinearity in the columns of $\mathbf{A}$ blows up. The QR route is far better behaved. Substitute $\mathbf{A} = \mathbf{Q}\mathbf{R}$, simplify using $\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}$, and the normal equations collapse to the single triangular system $\mathbf{R}\mathbf{x} = \mathbf{Q}^\top \mathbf{b}$. One back-substitution, no condition-number squaring. This is what np.linalg.lstsq actually does for moderately-sized regressions.

QR has a second life as the engine of the QR algorithm for eigenvalues. Iterating $\mathbf{A}_k = \mathbf{Q}_k \mathbf{R}_k$, $\mathbf{A}_{k+1} = \mathbf{R}_k \mathbf{Q}_k$ produces a sequence that, under mild conditions, converges to upper-triangular form, with the eigenvalues of the original matrix sitting on its diagonal. With Hessenberg reduction and shifts to accelerate convergence, this is how LAPACK's dgeev computes spectra. The QR algorithm was named one of the top ten algorithms of the twentieth century; its discovery by Francis and Kublanovskaya in 1961 turned eigenvalue computation from a delicate art into a reliable subroutine, and almost every spectral method in modern AI ultimately rides on it.

There is one more place QR earns its keep that is worth knowing about. Orthogonal weight initialisation, popular in deep recurrent networks, draws a random Gaussian matrix and uses its $\mathbf{Q}$ factor as the initial weights. Orthogonal matrices preserve norms exactly, so signals propagating through such a layer neither vanish nor explode. The whole idea rests on a single line of QR.

Why factorisations matter

Numerical linear algebra is dominated by these three factorisations because each does several jobs at once. They convert solving systems from the cost of one full $O(n^3)$ inversion per right-hand side into a single $O(n^3)$ factorisation followed by $O(n^2)$ work per solve. They expose the structure of $\mathbf{A}$ (positive-definiteness, rank, condition number, orientation) much more cleanly than the raw entries do. They come with stable, mature algorithms whose numerical behaviour has been studied for half a century. And their hardware-tuned implementations live inside BLAS and LAPACK, accessible from NumPy, SciPy, PyTorch, and JAX through one-line interfaces. Whenever you write np.linalg.solve, scipy.linalg.cho_factor, or np.linalg.qr, you are calling a Fortran routine that has been refined for decades and benchmarked across every major CPU and GPU architecture.

A useful mental model is to treat every factorisation as a change of basis that makes a hard problem easy. LU rotates into a basis where the system is triangular and substitution is trivial. Cholesky does the same while exploiting symmetry to halve the work. QR rotates into an orthonormal basis in which projection and least squares become one back-substitution. The art of numerical linear algebra is choosing the right basis for the problem at hand.

In modern deep learning, explicit factorisations of full weight matrices are uncommon. Nobody computes a Cholesky of a 10-billion-parameter weight matrix; the storage alone would be prohibitive. But factorisations show up everywhere in disguise: in optimiser preconditioning (K-FAC approximates the Fisher information by a Kronecker product of small factorisable blocks; Shampoo factorises gradients along tensor axes), in Gaussian process regression (every prediction is a Cholesky solve), in classical baselines (logistic regression with Newton's method, linear discriminant analysis), in orthogonal weight initialisation (the $\mathbf{Q}$ factor of a random Gaussian matrix), in whitening of input data (a Cholesky or eigendecomposition of the empirical covariance), and in computer-graphics pipelines that transform geometry through chains of structured matrices. Even a normalisation layer such as RMSNorm can be read as a diagonal Cholesky-style rescaling. Recognising the factorisations hiding inside these techniques is one of the practical dividends of a solid linear-algebra grounding.

What you should take away

Solving $\mathbf{A}\mathbf{x} = \mathbf{b}$ is the central problem of computational linear algebra, and explicit inversion is almost never the right way to do it; factorise instead.
LU factorisation packages Gaussian elimination as $\mathbf{P}\mathbf{A} = \mathbf{L}\mathbf{U}$, costs $O(n^3)$ once, and turns every subsequent solve into two $O(n^2)$ triangular substitutions.
Cholesky factorisation $\mathbf{A} = \mathbf{L}\mathbf{L}^\top$ specialises LU to symmetric positive-definite matrices, halves the work, needs no pivoting, and is the engine behind Gaussian-process inference and multivariate Gaussian sampling.
QR factorisation $\mathbf{A} = \mathbf{Q}\mathbf{R}$ provides orthonormal bases on demand and gives a numerically stable least-squares solver via $\mathbf{R}\mathbf{x} = \mathbf{Q}^\top \mathbf{b}$, sidestepping the ill-conditioned normal equations; iterating QR also computes eigenvalues.
Factorisations expose structure, rank, condition, definiteness, and underpin the BLAS and LAPACK routines that NumPy, SciPy, PyTorch, and JAX call for you, so the abstractions you reach for daily in AI all rest on these three classical decompositions.