2.4 Linear maps, rank, and the four subspaces

Every matrix you have met so far has had a quiet second life as a function. The grid of numbers is one face of it; the other face is a rule that takes a vector in and pushes a vector out. The rule is not arbitrary. It is linear, which is a precise way of saying that it is the most well-behaved kind of function we know how to write down: scaling the input scales the output by the same amount, and adding two inputs gives the sum of their two outputs. That is all "linear" means, and almost everything in classical machine learning, and a great deal in modern deep learning, is built on top of that single property.

Gilbert Strang, who taught a generation of engineers their linear algebra at MIT, liked to say that any matrix is fully described by four subspaces, two associated with its inputs, and two with its outputs. Together these "four fundamental subspaces" capture the entire shape of what a matrix does: which inputs it sees, which inputs it ignores, which outputs it can reach, and which outputs lie forever beyond it. Once you can sketch those four subspaces for a matrix, you understand the linear map at a level that lets you reason about least-squares regression, principal components, and rank collapse in transformers without ever opening a numerical library.

This section promotes matrices from arrays you can multiply (§2.3) to functions on vector spaces. §2.6 refines the picture with eigenvectors, and §2.7 gives the universal factorisation (the SVD) that lets us read all four subspaces straight off the matrix.

Symbols Used Here
$\mathbf{A}$matrix in $\mathbb{R}^{m \times n}$
$T_{\mathbf{A}}: \mathbb{R}^n \to \mathbb{R}^m$linear map, $T_{\mathbf{A}}(\mathbf{x}) = \mathbf{A}\mathbf{x}$
$\text{rank}(\mathbf{A})$rank: dimension of column space
$N(\mathbf{A})$null space: $\{\mathbf{x}: \mathbf{A}\mathbf{x} = \mathbf{0}\}$
$C(\mathbf{A})$column space: span of columns
$R(\mathbf{A})$row space: span of rows
$N(\mathbf{A}^\top)$left null space
$\dim$dimension

Linear maps

A function $T: \mathbb{R}^n \to \mathbb{R}^m$ is called linear when two simple rules hold for every choice of vectors $\mathbf{x}, \mathbf{y}$ and every scalar $\alpha$:

$$T(\mathbf{x} + \mathbf{y}) = T(\mathbf{x}) + T(\mathbf{y}), \qquad T(\alpha \mathbf{x}) = \alpha\, T(\mathbf{x}).$$

The first rule says the function respects addition; the second says it respects scaling. Together they rule out almost everything you can think of, squaring, taking sines, adding a constant, and leave behind a small but useful family of functions. A function as simple as $f(x) = x + 1$ is not linear in this sense, because $f(0) = 1$ rather than zero, and a linear map is forced to send the zero vector to the zero vector.

The fundamental theorem behind this section is that every linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$ can be written as multiplication by some $m \times n$ matrix, and every $m \times n$ matrix defines a linear map. The connection is mechanical: the $j$-th column of the matrix is whatever the map does to the $j$-th standard basis vector $\mathbf{e}_j = (0, \ldots, 1, \ldots, 0)$. Once you know how the map treats those special inputs, linearity tells you how it treats every other input.

A worked example fixes the picture. Take a 90-degree anti-clockwise rotation in the plane. To find its matrix, ask what the rotation does to the two standard basis vectors. The vector $(1, 0)$ pointing east rotates to $(0, 1)$ pointing north. The vector $(0, 1)$ pointing north rotates to $(-1, 0)$ pointing west. Stack these results as columns:

$$\mathbf{R} = \begin{pmatrix} 0 & -1 \\ 1 & 0 \end{pmatrix}.$$

Try it on $(1, 0)$: the matrix-vector product gives $(0, 1)$. Try it on $(0, 1)$: the product gives $(-1, 0)$. Try it on a less special vector such as $(2, 3)$: the product is $(-3, 2)$, which is indeed the original vector turned a quarter-turn anticlockwise. The whole rotation has been packed into four numbers.

Rank

The rank of a matrix is one of the most useful single numbers in linear algebra. Informally, it measures how much "real" information the matrix carries. Formally, it is the number of linearly independent columns, and, by a non-obvious but reassuring theorem, it is also the number of linearly independent rows. The two counts always agree.

Consider

$$\mathbf{A} = \begin{pmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \end{pmatrix}.$$

There are three columns, so the rank is at most three. But there are only two rows, so the rank is at most two. Look more closely: the second row is exactly twice the first, so the two rows together carry only one row's worth of information. The rank is therefore one. Equivalently, every column is a multiple of $(1, 2)^\top$: the second column is two times the first, the third is three times the first. From the column point of view, only one independent direction is on offer.

Compare with the $2 \times 2$ identity matrix

$$\mathbf{I}_2 = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix},$$

whose two columns are the standard basis vectors. Neither is a multiple of the other, so both are independent and the rank is two. A matrix whose rank equals the smaller of its two dimensions is called full rank; otherwise it is rank deficient. Rank deficiency is what makes a square system $\mathbf{A}\mathbf{x} = \mathbf{b}$ either insoluble or ambiguous, and it is what makes data matrices in ML compressible: a $1000 \times 1000$ matrix of rank 10 contains only as much information as twenty length-10 vectors. The smaller the rank, the flatter the structure inside the matrix, and the more aggressively it can be summarised.

Rank can be read off from row reduction (Gaussian elimination): reduce the matrix to row echelon form and count the non-zero rows. In practice, with floating-point matrices, one computes the singular value decomposition and counts singular values that exceed a small tolerance, see §2.7. Either way, rank is a property of the linear map, not of the particular numbers in the matrix; similar matrices (related by a change of basis) share the same rank.

A useful mental rule: rank is bounded above by the smaller of $m$ and $n$, so a tall thin matrix can never have rank greater than its number of columns, and a short wide matrix can never have rank greater than its number of rows. Most randomly generated matrices are full rank; rank deficiency is the special case, and it is precisely the special case that we exploit when we compress data, regularise models, or factor a weight update into two low-rank pieces.

Column space

The column space of a matrix $\mathbf{A}$, written $C(\mathbf{A})$, is the set of all vectors of the form $\mathbf{A}\mathbf{x}$ as $\mathbf{x}$ ranges over $\mathbb{R}^n$. It is exactly the image of the linear map: the collection of every output the map can possibly produce. Multiplying $\mathbf{A}$ by a vector $\mathbf{x}$ takes a weighted combination of the columns of $\mathbf{A}$, with the entries of $\mathbf{x}$ as weights, so the column space is also the span of the columns. These two definitions describe the same set from opposite angles, and you should keep both pictures in mind.

The column space is a subspace of $\mathbb{R}^m$ (the output space). Its dimension is the rank: that is the working definition of rank we use most often. A full-rank matrix with $m \le n$ has a column space that fills all of $\mathbb{R}^m$; a rank-deficient matrix has a column space that is a strict subspace, usually a flat plane or line through the origin sitting inside the larger space.

Returning to the rank-1 matrix from the previous section, $\mathbf{A} = \begin{pmatrix} 1 & 2 & 3 \\ 2 & 4 & 6 \end{pmatrix}$, the columns $(1,2)^\top$, $(2,4)^\top$ and $(3,6)^\top$ all lie on the line through the origin in the direction of $(1,2)^\top$. The column space is exactly that one-dimensional line. No matter what input $\mathbf{x}$ you feed in, no matter how clever or unusual, the output is forced to land on that line. The map can stretch along the line, but it cannot escape it.

Null space

The null space $N(\mathbf{A})$ collects the inputs that the matrix sends to zero: every $\mathbf{x}$ with $\mathbf{A}\mathbf{x} = \mathbf{0}$. It is a subspace of $\mathbb{R}^n$ (the input space) and its dimension is $n - \text{rank}(\mathbf{A})$, a relation known as the rank–nullity theorem. The intuition is bookkeeping: of the $n$ input directions, some number equal to the rank survive into the column space, and the rest collapse to the origin.

A non-trivial null space is a sign that the map throws information away. If two different inputs $\mathbf{x}_1$ and $\mathbf{x}_2$ produce the same output, then their difference lies in the null space. So the null space measures the failure of the map to be injective, the directions along which the map cannot tell inputs apart.

Let us work it out for $\mathbf{B} = \begin{pmatrix} 1 & 2 \\ 2 & 4 \end{pmatrix}$. The matrix has rank 1 (the second row is twice the first, just as before), so the null space should have dimension $2 - 1 = 1$. To find a basis vector, solve $\mathbf{B}\mathbf{x} = \mathbf{0}$, which is the single equation $x_1 + 2x_2 = 0$ (the second equation is a copy). Setting $x_2 = 1$ gives $x_1 = -2$, so $(-2, 1)^\top$ spans the null space. We verify by direct calculation:

$$\mathbf{B}\begin{pmatrix} -2 \\ 1 \end{pmatrix} = \begin{pmatrix} 1\cdot(-2) + 2\cdot 1 \\ 2\cdot(-2) + 4\cdot 1 \end{pmatrix} = \begin{pmatrix} 0 \\ 0 \end{pmatrix}.\ \checkmark$$

Geometrically, the null space is the line through the origin in the direction $(-2, 1)$. Every point on that line is squashed to the origin. The map collapses a two-dimensional plane onto a one-dimensional line, and the null space records the direction of collapse.

Row space and left null space

The remaining two subspaces come from applying the same constructions to the transpose $\mathbf{A}^\top$. The row space $R(\mathbf{A})$ is the span of the rows of $\mathbf{A}$, equivalently $C(\mathbf{A}^\top)$. It lives in $\mathbb{R}^n$ (because each row has $n$ entries) and has dimension equal to the rank, the same rank, by the deep theorem mentioned earlier that row rank equals column rank. The left null space $N(\mathbf{A}^\top)$ is the set of vectors $\mathbf{y}$ in $\mathbb{R}^m$ with $\mathbf{A}^\top \mathbf{y} = \mathbf{0}$, equivalently $\mathbf{y}^\top \mathbf{A} = \mathbf{0}^\top$. Its dimension is $m - \text{rank}(\mathbf{A})$.

Here is the picture Strang draws on the board. The input space $\mathbb{R}^n$ splits cleanly into two pieces:

$$\mathbb{R}^n = R(\mathbf{A}) \oplus N(\mathbf{A}),$$

where the symbol $\oplus$ means "direct sum of orthogonal subspaces". Every input vector decomposes uniquely as a part in the row space and a part in the null space, and the two parts are at right angles. The same structure holds on the output side:

$$\mathbb{R}^m = C(\mathbf{A}) \oplus N(\mathbf{A}^\top),$$

so every output decomposition is a column-space part plus a left-null-space part, again at right angles. The matrix sends the row-space half of the input bijectively onto the column space, and annihilates the null-space half. The left null space contains the directions in the output that the matrix can never reach.

This orthogonal decomposition is the foundation of two of the most useful procedures in applied linear algebra. Least-squares regression projects a target vector $\mathbf{b}$ onto the column space of the design matrix, removing the part that lies in the left null space (which is by definition the residual). The pseudo-inverse formalises this projection and produces the minimum-norm solution when the system is under-determined. Principal component analysis chooses a basis for the row space that captures variance most efficiently; the SVD makes the choice canonical. All of this rests on the four subspaces lining up as orthogonal complements.

It also explains a curious feature of any linear map: the actual "work" happens between two equally-sized spaces, the row space (in the input) and the column space (in the output), both of dimension equal to the rank. Surrounding them are two reservoirs, the null space and the left null space, that contain exactly the directions the map cannot see and cannot produce.

Worked $3 \times 2$ example

Let us pin all four subspaces down on a small example. Take

$$\mathbf{A} = \begin{pmatrix} 1 & 2 \\ 3 & 6 \\ 5 & 10 \end{pmatrix}, \qquad m = 3, \quad n = 2.$$

Look at the columns: $(1, 3, 5)^\top$ and $(2, 6, 10)^\top$. The second is exactly twice the first, so the columns are dependent and the rank is at most one. Both columns are non-zero, so the rank is at least one. Therefore $\text{rank}(\mathbf{A}) = 1$.

Column space. All columns are multiples of $(1, 3, 5)^\top$, so $C(\mathbf{A}) = \text{span}\{(1, 3, 5)^\top\}$, a one-dimensional line through the origin in $\mathbb{R}^3$. Whatever vector you feed in, the output lands somewhere on that line.

Row space. The rows are $(1, 2)$, $(3, 6)$, and $(5, 10)$, every row is a multiple of $(1, 2)$. So $R(\mathbf{A}) = \text{span}\{(1, 2)\}$, again a one-dimensional line, but this time in the input space $\mathbb{R}^2$. As required, $\dim R = \dim C = \text{rank} = 1$.

Null space. We solve $\mathbf{A}\mathbf{x} = \mathbf{0}$, which in fully written-out form is the three equations $x_1 + 2x_2 = 0$, $3x_1 + 6x_2 = 0$, and $5x_1 + 10x_2 = 0$. They are all the same equation. Setting $x_2 = 1$ gives $x_1 = -2$, so $N(\mathbf{A}) = \text{span}\{(-2, 1)^\top\}$, dimension 1.

Left null space. We need $\mathbf{y} = (y_1, y_2, y_3)^\top$ with $\mathbf{A}^\top \mathbf{y} = \mathbf{0}$, which gives the two equations $y_1 + 3y_2 + 5y_3 = 0$ and $2y_1 + 6y_2 + 10y_3 = 0$. The second is twice the first, so there is really one constraint on three unknowns. Two free parameters remain, so $\dim N(\mathbf{A}^\top) = 2$. A basis: take $y_2 = 1, y_3 = 0$ to get $(-3, 1, 0)$; take $y_2 = 0, y_3 = 1$ to get $(-5, 0, 1)$. These two vectors span the plane in $\mathbb{R}^3$ that the matrix can never reach.

Dimensional bookkeeping. Let us check that everything balances:

$$\dim R(\mathbf{A}) + \dim N(\mathbf{A}) = 1 + 1 = 2 = n,\ \checkmark$$ $$\dim C(\mathbf{A}) + \dim N(\mathbf{A}^\top) = 1 + 2 = 3 = m.\ \checkmark$$

The input space $\mathbb{R}^2$ splits into a one-dimensional row space and a one-dimensional null space; the output space $\mathbb{R}^3$ splits into a one-dimensional column space and a two-dimensional left null space. The map carries the row-space line in the input bijectively onto the column-space line in the output, and squashes everything else.

Why this matters in AI

The four-subspace picture is not a museum piece; it shows up directly in modern machine learning.

  • Underdetermined systems ($n > m$, more parameters than constraints) have a non-trivial null space, so $\mathbf{A}\mathbf{x} = \mathbf{b}$ has either no solution or infinitely many. When it has many, the standard remedy is the minimum-norm solution: among all $\mathbf{x}$ that solve the system, pick the one closest to the origin. The pseudo-inverse delivers it. Almost every neural-network loss landscape near a perfect fit is this kind of system.

  • Overdetermined systems ($m > n$, more equations than unknowns) generically have $\mathbf{b} \notin C(\mathbf{A})$, the target vector lies partly outside the column space. The least-squares solution is the projection of $\mathbf{b}$ onto $C(\mathbf{A})$, removing the unattainable left-null-space component as residual. Linear regression is exactly this projection.

  • Low-rank matrices compress dramatically. A rank-$r$ matrix in $\mathbb{R}^{m \times n}$ can be stored as two thinner matrices of total size $r(m + n)$, often vastly smaller than $mn$. This is the backbone of principal component analysis (§2.8), of recommendation systems based on matrix factorisation, and of LoRA fine-tuning, where a frozen pretrained weight matrix is updated by a small low-rank correction with a tiny memory footprint.

  • Hidden representations in neural networks are governed by the rank of intermediate weight matrices. Each layer's matrix has a rank that bounds the richness of features it can carry; a layer that drifts toward low effective rank has lost expressive capacity. Rank collapse in deep transformers, where token representations across positions become near-multiples of a single direction, is a well-known failure mode that designers actively guard against with skip connections, layer normalisation, and careful initialisation.

  • Diagnosing models. When a trained network behaves strangely, looking at the singular value spectra of its weight matrices often reveals the trouble. A matrix whose singular values trail off rapidly is effectively low-rank and may have collapsed during training; a matrix whose singular values are uniformly large may indicate failure to specialise. The four-subspace vocabulary gives you a precise language for these diagnoses, far more useful than vague talk about "model capacity".

What you should take away

  1. A matrix is a linear map: $\mathbf{A}\mathbf{x}$ takes inputs in $\mathbb{R}^n$ and produces outputs in $\mathbb{R}^m$, respecting addition and scaling.
  2. Rank is the number of independent rows, equivalently the number of independent columns, equivalently the dimension of either the row space or the column space.
  3. Every matrix carries four subspaces, row space and null space inside the input, column space and left null space inside the output, and on each side they are orthogonal complements.
  4. Rank–nullity ($\dim R + \dim N = n$ and $\dim C + \dim N(\mathbf{A}^\top) = m$) is the bookkeeping that makes the four-subspace picture consistent.
  5. The same picture reappears throughout AI: under- and over-determined systems, low-rank compression in PCA and LoRA, and rank collapse in transformers all live or die by the dimensions of these four subspaces.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).