2.3 Matrices and matrix multiplication
In artificial intelligence, matrices are everywhere. They appear as collections of data examples, where each row is a single sample (a patient, a sentence, an image) and each column is a feature of that sample (a measurement, a token frequency, a pixel intensity). They appear as transformations, where a single layer of a neural network is, at its core, a matrix that takes the input vector and produces the output vector for the next layer. They appear as descriptions of relationships, where the entry in row $i$ and column $j$ records the similarity between item $i$ and item $j$, or whether node $i$ is connected to node $j$ in a graph. The training of a modern large language model is, almost in its entirety, a long sequence of matrix multiplications: the operation is so central to deep learning that GPUs are essentially highly parallel matrix-multiplication engines.
This section stacks vectors into rectangles and develops their algebra; §2.4 revisits the same operation as a linear map, a function that sends vectors to vectors while preserving addition and scalar multiplication.
What a matrix is
Concretely, a matrix is a rectangular array of real numbers organised into rows and columns. Consider
$$\mathbf{A} = \begin{pmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{pmatrix}.$$
This is a $2 \times 3$ matrix: two rows, three columns. The convention "rows by columns" is universal and worth memorising at once, because confusing it with "columns by rows" leads to dimension-mismatch errors that cost hours to debug. The set of all real matrices of this shape is written $\mathbb{R}^{2 \times 3}$, and more generally $\mathbb{R}^{m \times n}$ denotes matrices with $m$ rows and $n$ columns.
To refer to a single entry, we use two subscripts. The first subscript names the row, the second names the column. So $A_{12}$ is the entry in row 1, column 2, which in the example above is the number $2$. Likewise $A_{23}$ is in row 2, column 3, which is $6$. Some authors and many programming languages use zero-based indexing, in which the first row is row 0; this textbook follows the mathematical convention of starting at 1, but the underlying object is the same and you should be able to translate between conventions on sight.
Three different ways of looking at a matrix recur throughout AI, and you should learn to switch between them fluently:
The row-of-features view. Each row is one data point and each column is one feature. A matrix of size $m \times n$ then represents $m$ examples, each described by $n$ measurements. A clinical dataset might have one row per patient and columns for age, weight, blood pressure, and so on. This is the dominant view in classical statistics and tabular machine learning, and it is how a CSV file becomes a matrix in NumPy or pandas.
The transformation view. A matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ is a function that consumes a vector of length $n$ and produces a vector of length $m$. The fully-connected layers of a neural network are exactly this: each layer is a matrix that maps the incoming activation vector to the outgoing activation vector, with a non-linear function applied afterwards. Section 2.4 develops this view in depth.
The relationship view. The entry $A_{ij}$ records something about the pair $(i, j)$: how similar two documents are, whether two friends are connected on a social network, how many times user $i$ has clicked on item $j$. Adjacency matrices, similarity matrices, and confusion matrices all fall under this heading.
The same array can play any of these roles. What changes is the meaning we attach to its rows and columns. A skilled practitioner moves between views without conscious effort.
Transpose
The transpose flips a matrix across its main diagonal: row $i$ becomes column $i$. Formally, if $\mathbf{A}$ is $m \times n$, then $\mathbf{A}^\top$ is the $n \times m$ matrix with $(A^\top)_{ij} = A_{ji}$. The subscripts are swapped: that is the whole definition.
For the matrix above,
$$\mathbf{A}^\top = \begin{pmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{pmatrix}.$$
The original $\mathbf{A}$ was $2 \times 3$; the transpose is $3 \times 2$. Notice that the first row of $\mathbf{A}$, namely $(1, 2, 3)$, has become the first column of $\mathbf{A}^\top$.
Two algebraic facts about the transpose are worth committing to memory.
First, transposing twice returns the original matrix: $(\mathbf{A}^\top)^\top = \mathbf{A}$. The transpose is its own inverse. This is obvious from the definition, swapping subscripts twice leaves them where they started, but it is used so often that it deserves its own line.
Second, the transpose of a product reverses the order of the factors: $(\mathbf{A}\mathbf{B})^\top = \mathbf{B}^\top \mathbf{A}^\top$. The reversal is necessary because the inner dimensions have to keep matching. This identity will appear in nearly every gradient derivation in the rest of the book. When backpropagation pushes an error signal from the output of a layer back to its input, the matrix that propagates the signal is the transpose of the matrix that produced the output. Gradients flow through transposes, and the transpose of a product reverses order, these two facts are the algebraic skeleton of backprop.
The transpose also unifies dot products with matrix products: the dot product of two column vectors can be written $\mathbf{u}^\top \mathbf{v}$, the row vector $\mathbf{u}^\top$ multiplying the column vector $\mathbf{v}$ to yield a $1 \times 1$ matrix, a scalar. This is what lets almost all of linear algebra be expressed in a single notation.
Matrix-vector multiplication
The simplest non-trivial operation involving a matrix is to multiply it by a vector. If $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{x} \in \mathbb{R}^n$, the product $\mathbf{y} = \mathbf{A}\mathbf{x}$ is a vector in $\mathbb{R}^m$. The number of columns of $\mathbf{A}$ must equal the length of $\mathbf{x}$, and the resulting vector has the same length as the number of rows of $\mathbf{A}$. The defining formula is
$$y_i = \sum_{j=1}^n A_{ij}\, x_j \qquad \text{for } i = 1, 2, \ldots, m.$$
In words, the $i$-th entry of the output is computed by walking along the $i$-th row of the matrix, multiplying each entry by the corresponding entry of $\mathbf{x}$, and summing.
A worked example. Take
$$\mathbf{A} = \begin{pmatrix} 1 & 2 \\ 3 & 4 \\ 5 & 6 \end{pmatrix}, \qquad \mathbf{x} = \begin{pmatrix} 1 \\ 1 \end{pmatrix}.$$
The matrix is $3 \times 2$ and the vector has length $2$, so the inner dimensions match and the result will be a vector of length $3$. Computing entry by entry:
$$\mathbf{y} = \mathbf{A}\mathbf{x} = \begin{pmatrix} 1\cdot 1 + 2\cdot 1 \\ 3\cdot 1 + 4\cdot 1 \\ 5\cdot 1 + 6\cdot 1 \end{pmatrix} = \begin{pmatrix} 3 \\ 7 \\ 11 \end{pmatrix}.$$
There are two equivalent ways of reading this calculation, and both deserve to be in your head simultaneously.
View 1, entries as dot products of rows. Each entry of $\mathbf{y}$ is the dot product of the corresponding row of $\mathbf{A}$ with $\mathbf{x}$. The first row $(1, 2)$ dotted with $(1, 1)$ gives $3$; the second row $(3, 4)$ dotted with $(1, 1)$ gives $7$; the third row $(5, 6)$ dotted with $(1, 1)$ gives $11$. This is the view that matches the formula and is most useful when you want to compute by hand.
View 2, output as a linear combination of columns. The output $\mathbf{y}$ can also be read as a recipe for combining the columns of $\mathbf{A}$. The first column is $(1, 3, 5)^\top$, the second column is $(2, 4, 6)^\top$, and the entries of $\mathbf{x}$, namely $1$ and $1$, tell us how much of each column to take. So
$$\mathbf{A}\mathbf{x} = 1 \cdot \begin{pmatrix} 1 \\ 3 \\ 5 \end{pmatrix} + 1 \cdot \begin{pmatrix} 2 \\ 4 \\ 6 \end{pmatrix} = \begin{pmatrix} 3 \\ 7 \\ 11 \end{pmatrix},$$
which agrees with the previous calculation as it must.
The second view is more than a curiosity. It tells us something deep about what matrix-vector multiplication can produce: the output $\mathbf{A}\mathbf{x}$ is always some linear combination of the columns of $\mathbf{A}$, no matter what $\mathbf{x}$ we choose. The set of all such linear combinations, every output that $\mathbf{A}$ can possibly produce as $\mathbf{x}$ ranges over $\mathbb{R}^n$, is called the column space of $\mathbf{A}$. If a target vector $\mathbf{b}$ does not lie in the column space, then no $\mathbf{x}$ exists with $\mathbf{A}\mathbf{x} = \mathbf{b}$; the equation has no solution. This observation is the key that unlocks much of section 2.4 and the geometry of least-squares regression.
Matrix-matrix multiplication
Multiplying two matrices is the natural generalisation of multiplying a matrix by a vector. If $\mathbf{A} \in \mathbb{R}^{m \times k}$ and $\mathbf{B} \in \mathbb{R}^{k \times n}$, the product $\mathbf{C} = \mathbf{A}\mathbf{B}$ is an $m \times n$ matrix whose entries are given by
$$C_{ij} = \sum_{p=1}^k A_{ip}\, B_{pj}.$$
The shared dimension $k$ is called the inner dimension. It must be the same for both matrices: the number of columns of $\mathbf{A}$ must equal the number of rows of $\mathbf{B}$. The outer dimensions, $m$ and $n$, become the shape of the result. A useful mnemonic: writing the shapes side by side, $(m \times k)(k \times n) = (m \times n)$, the inner $k$s cancel and the outer dimensions survive.
A fully worked $2 \times 2$ example. Let
$$\mathbf{A} = \begin{pmatrix} 1 & 2 \\ 3 & 4 \end{pmatrix}, \qquad \mathbf{B} = \begin{pmatrix} 5 & 6 \\ 7 & 8 \end{pmatrix}.$$
To compute the entry in row $1$, column $1$ of $\mathbf{C} = \mathbf{A}\mathbf{B}$, take the dot product of the first row of $\mathbf{A}$ with the first column of $\mathbf{B}$: $1 \cdot 5 + 2 \cdot 7 = 5 + 14 = 19$. For row $1$, column $2$: $1 \cdot 6 + 2 \cdot 8 = 6 + 16 = 22$. For row $2$, column $1$: $3 \cdot 5 + 4 \cdot 7 = 15 + 28 = 43$. For row $2$, column $2$: $3 \cdot 6 + 4 \cdot 8 = 18 + 32 = 50$. Assembling the entries,
$$\mathbf{C} = \mathbf{A}\mathbf{B} = \begin{pmatrix} 19 & 22 \\ 43 & 50 \end{pmatrix}.$$
The product satisfies several familiar-looking properties.
Associative. $(\mathbf{A}\mathbf{B})\mathbf{C} = \mathbf{A}(\mathbf{B}\mathbf{C})$. The order in which we group the multiplications does not change the answer, so we can freely write $\mathbf{A}\mathbf{B}\mathbf{C}$ without parentheses. Associativity is what allows a deep neural network to compute its forward pass layer by layer, and it lets compilers re-associate long chains of matrix multiplications to minimise the total number of arithmetic operations.
Distributive. $\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{A}\mathbf{B} + \mathbf{A}\mathbf{C}$ and $(\mathbf{A} + \mathbf{B})\mathbf{C} = \mathbf{A}\mathbf{C} + \mathbf{B}\mathbf{C}$. Multiplication distributes over addition, just as it does for ordinary numbers.
Identity. There is a special matrix, the identity $\mathbf{I}$, such that $\mathbf{A}\mathbf{I} = \mathbf{A} = \mathbf{I}\mathbf{A}$ whenever the shapes are compatible. The identity is described in detail below.
But there is one property that very much does not hold: matrix multiplication is not commutative. In general, $\mathbf{A}\mathbf{B} \ne \mathbf{B}\mathbf{A}$, even when both products are defined. To make this concrete, compute $\mathbf{B}\mathbf{A}$ for the same $\mathbf{A}$ and $\mathbf{B}$ as above:
$$\mathbf{B}\mathbf{A} = \begin{pmatrix} 5\cdot 1 + 6\cdot 3 & 5\cdot 2 + 6\cdot 4 \\ 7\cdot 1 + 8\cdot 3 & 7\cdot 2 + 8\cdot 4 \end{pmatrix} = \begin{pmatrix} 23 & 34 \\ 31 & 46 \end{pmatrix}.$$
Comparing: $\mathbf{A}\mathbf{B} = \begin{pmatrix} 19 & 22 \\ 43 & 50 \end{pmatrix}$ but $\mathbf{B}\mathbf{A} = \begin{pmatrix} 23 & 34 \\ 31 & 46 \end{pmatrix}$. Different numbers throughout. The order in which you multiply matters, and confusing $\mathbf{A}\mathbf{B}$ with $\mathbf{B}\mathbf{A}$ is one of the most common sources of incorrect derivations in machine learning. When you reach the chain rule for matrix-valued functions in section 2.9 and chapter 9, the non-commutativity will dictate that gradients propagate through transposes of matrices in a particular order, get the order wrong and the dimensions will not even match, never mind the numbers.
For now, it is enough to internalise the rule: never silently swap the order of matrix products.
Three views of matrix multiplication
Just as there were two equivalent views of matrix-vector multiplication, there are several equivalent views of matrix-matrix multiplication. All three of the following formulae compute the same matrix $\mathbf{C} = \mathbf{A}\mathbf{B}$, but each foregrounds a different aspect of the operation, and each is useful in different contexts. A serious linear-algebra practitioner can switch fluidly between them.
View 1: Row-by-column (the by-the-book formula). The entry $C_{ij}$ is the dot product of row $i$ of $\mathbf{A}$ with column $j$ of $\mathbf{B}$:
$$C_{ij} = \mathbf{a}_{i,:} \cdot \mathbf{b}_{:,j} = \sum_{p=1}^k A_{ip} B_{pj}.$$
This is the view encoded in the definition. It is the right way to think when you are computing the product by hand or asking what a particular entry equals. It is also the textbook starting point because it makes the dimensional bookkeeping transparent: the dot product of a $k$-vector with a $k$-vector is a scalar, and the result is a single entry of the $m \times n$ output.
View 2: Linear combination of columns. The $j$-th column of $\mathbf{C}$ equals $\mathbf{A}$ times the $j$-th column of $\mathbf{B}$:
$$\mathbf{c}_j = \mathbf{A}\mathbf{b}_j.$$
By the column view of matrix-vector multiplication, $\mathbf{A}\mathbf{b}_j$ is a linear combination of the columns of $\mathbf{A}$, with coefficients given by the entries of $\mathbf{b}_j$. Therefore every column of $\mathbf{C}$ is a linear combination of the columns of $\mathbf{A}$, full stop. This has a useful consequence: the column space of $\mathbf{C}$ is contained in the column space of $\mathbf{A}$. Whatever vectors $\mathbf{C}$ can produce as outputs, $\mathbf{A}$ alone could already produce. This is the right view to take when you are reasoning about what a matrix product can and cannot generate, and it is fundamental to discussions of rank and column space.
View 3: Sum of outer products. The product can also be written as a sum of $k$ rank-one matrices:
$$\mathbf{A}\mathbf{B} = \sum_{p=1}^k \mathbf{a}_p \mathbf{b}_p^\top,$$
where $\mathbf{a}_p$ is the $p$-th column of $\mathbf{A}$ (a column vector) and $\mathbf{b}_p^\top$ is the $p$-th row of $\mathbf{B}$ (a row vector). Each individual term $\mathbf{a}_p \mathbf{b}_p^\top$ is an outer product: a column times a row, producing an $m \times n$ matrix whose every entry is the product of one entry from $\mathbf{a}_p$ and one entry from $\mathbf{b}_p^\top$. An outer product has rank $1$, meaning all of its columns are scalar multiples of a single vector. The sum of $k$ rank-one matrices has rank at most $k$, so this view immediately bounds the rank of any product: $\operatorname{rank}(\mathbf{A}\mathbf{B}) \le \min(\operatorname{rank}(\mathbf{A}), \operatorname{rank}(\mathbf{B})) \le k$. View 3 is the seed of low-rank decompositions, of the singular value decomposition (section 2.7), and of the LoRA fine-tuning trick that lets us cheaply adapt a frozen large language model by adding a small sum of outer products to its weight matrices.
The three views are all correct, and a fluent reader keeps them all in mind: View 1 to compute, View 2 to reason about column spaces and what outputs are possible, View 3 to reason about rank, decomposition, and approximation. Many proofs in linear algebra and many algorithms in deep learning hinge on switching from one view to another at the right moment.
Special matrices
Certain matrices have so much structure that they are given names of their own and are worth knowing on sight.
The identity matrix $\mathbf{I}_n$ is the $n \times n$ matrix with $1$s on the main diagonal and $0$s everywhere else. For example,
$$\mathbf{I}_2 = \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}, \qquad \mathbf{I}_3 = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}.$$
The identity does to matrices what the number $1$ does to ordinary numbers: it leaves them unchanged. For any matrix $\mathbf{A}$ with compatible shape, $\mathbf{I}\mathbf{A} = \mathbf{A}$ and $\mathbf{A}\mathbf{I} = \mathbf{A}$. Geometrically, the identity is the linear map that does nothing, every vector is sent to itself. The identity will reappear when we define matrix inverses ($\mathbf{A}^{-1}$ is the matrix that, when multiplied with $\mathbf{A}$, gives $\mathbf{I}$) and when we describe orthogonal matrices.
Diagonal matrices have non-zero entries only on the main diagonal. The notation $\operatorname{diag}(d_1, d_2, \ldots, d_n)$ denotes the diagonal matrix with the listed entries:
$$\operatorname{diag}(2, 5, 7) = \begin{pmatrix} 2 & 0 & 0 \\ 0 & 5 & 0 \\ 0 & 0 & 7 \end{pmatrix}.$$
Multiplying a vector by a diagonal matrix scales each component independently: $\operatorname{diag}(2, 5, 7) (x_1, x_2, x_3)^\top = (2x_1, 5x_2, 7x_3)^\top$. Multiplying a matrix on the left by a diagonal matrix scales its rows; multiplying on the right scales its columns. Diagonal matrices are cheap to multiply, cheap to invert (provided no diagonal entry is zero), and play the starring role in the eigendecomposition (section 2.6) and the singular value decomposition (section 2.7), in which any matrix is rewritten as a product of three matrices with a diagonal one in the middle.
Symmetric matrices satisfy $\mathbf{A} = \mathbf{A}^\top$, flipping across the diagonal leaves them unchanged. A worked example:
$$\mathbf{S} = \begin{pmatrix} 1 & 2 \\ 2 & 3 \end{pmatrix}.$$
The off-diagonal entries are mirrored across the main diagonal, so transposing does nothing. Symmetric matrices arise constantly in AI: every covariance matrix is symmetric (because $\operatorname{Cov}(X, Y) = \operatorname{Cov}(Y, X)$); every Gram or similarity matrix $\mathbf{X}^\top \mathbf{X}$ is symmetric; every Hessian of a twice-differentiable scalar function is symmetric (a consequence of the equality of mixed partial derivatives). The spectral theorem of section 2.6 says that real symmetric matrices have a complete set of real eigenvalues and orthogonal eigenvectors, a structure that underlies principal component analysis and most of unsupervised learning.
Orthogonal matrices are square matrices $\mathbf{Q}$ that satisfy $\mathbf{Q}^\top \mathbf{Q} = \mathbf{I}$. Equivalently, the columns of $\mathbf{Q}$ are mutually orthogonal unit vectors. For a $2 \times 2$ example,
$$\mathbf{R}_\theta = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix}$$
is orthogonal for every angle $\theta$, it represents rotation by $\theta$ in the plane. More generally, every orthogonal matrix represents either a rotation (when its determinant is $+1$) or a reflection (when its determinant is $-1$). Geometrically, orthogonal matrices preserve lengths and angles: if $\mathbf{Q}$ is orthogonal, then $\|\mathbf{Q}\mathbf{x}\| = \|\mathbf{x}\|$ for every $\mathbf{x}$, and the angle between any two vectors is unchanged when both are rotated by $\mathbf{Q}$. This length-preserving property makes orthogonal matrices the cleanest possible linear transformations and explains their ubiquity: they are the orientation-preserving (or reversing) rigid motions of Euclidean space. In numerical practice they appear in the QR decomposition, in the singular value decomposition, in 3D graphics rotations, and as constraints on neural-network parametrisations (orthogonal weight initialisation, orthogonal regularisation) where preserving the norm of the activations layer-by-layer prevents vanishing or exploding gradients.
Triangular matrices come in two flavours. An upper-triangular matrix has zeros below the main diagonal; a lower-triangular matrix has zeros above. Triangular matrices are the workhorses of practical linear algebra because solving a system $\mathbf{A}\mathbf{x} = \mathbf{b}$ when $\mathbf{A}$ is triangular is dramatically cheaper than for a general $\mathbf{A}$: the technique is called back-substitution (for upper triangular) or forward-substitution (for lower triangular), and it costs $O(n^2)$ rather than $O(n^3)$. The LU decomposition (section 2.4a) factorises any square matrix as a product of a lower-triangular and an upper-triangular matrix, reducing the solution of a general system to two triangular solves.
There are several other named families, sparse, banded, Toeplitz, circulant, positive definite, that you will meet as needed. The five above are the absolute minimum you should be able to recognise on sight.
Computational cost
The naive algorithm for multiplying two $n \times n$ matrices follows directly from the definition: there are $n^2$ output entries, each a dot product of two $n$-vectors costing $n$ multiplications and $n$ additions. The total cost is $O(n^3)$. Multiplying two $1000 \times 1000$ matrices requires roughly $10^9$ multiply-add operations, a few hundred milliseconds on a CPU with tuned BLAS, a small fraction of a millisecond on a GPU with tensor cores. The cubic scaling is what makes very large matrix multiplications expensive: doubling $n$ multiplies the work by eight.
In 1969 Volker Strassen showed that two $n \times n$ matrices can be multiplied in $O(n^{\log_2 7}) \approx O(n^{2.807})$ scalar operations by a recursive scheme that splits each matrix into four blocks and uses seven block multiplications instead of eight. Strassen is occasionally used for very large CPU multiplications, though the recursion is awkward on GPU tensor cores. Coppersmith-Winograd-style refinements have pushed the exponent below $2.373$, but the constants are too large for practical use.
In practice, almost all matrix multiplication in deep learning is done by tuned BLAS libraries (OpenBLAS, MKL, cuBLAS) that implement the cubic algorithm with extreme attention to cache hierarchy, SIMD or tensor-core vectorisation, and tiling into blocks small enough to fit in fast memory. The difference between a naive triple-loop in Python and a single call to numpy.matmul is several orders of magnitude, not because of asymptotics, but because of constant factors. The lesson generalises: never write the inner loop yourself when an optimised library exists for your hardware.
What you should take away
- A matrix is a rectangular array of numbers, and the same array can be read as a dataset of features, as a transformation between vector spaces, or as a record of relationships, switch between views as the problem demands.
- The transpose flips rows and columns, and the rule $(\mathbf{A}\mathbf{B})^\top = \mathbf{B}^\top \mathbf{A}^\top$, products reverse on transposing, is the algebraic backbone of backpropagation.
- Matrix-vector multiplication can be read as either dot products of rows with the input or as a linear combination of the columns of the matrix; the second view tells you that all outputs lie in the column space.
- Matrix multiplication is associative and distributive but not commutative; in general $\mathbf{A}\mathbf{B} \ne \mathbf{B}\mathbf{A}$, and the worked $2 \times 2$ example gives $\mathbf{A}\mathbf{B} = \begin{pmatrix} 19 & 22 \\ 43 & 50 \end{pmatrix}$ but $\mathbf{B}\mathbf{A} = \begin{pmatrix} 23 & 34 \\ 31 & 46 \end{pmatrix}$.
- The three views of matrix multiplication, row-by-column for computing entries, column space for understanding what outputs are possible, and sum of outer products for understanding rank, are all correct and complementary, and recognising the named special matrices (identity, diagonal, symmetric, orthogonal, triangular) on sight pays for itself many times over in the rest of the book.