Matrix Multiplication, Glossary, Textbook of AI

Matrix multiplication of $A \in \mathbb{R}^{m \times n}$ and $B \in \mathbb{R}^{n \times p}$ produces $C = AB \in \mathbb{R}^{m \times p}$ with

$$C_{ij} = \sum_{k=1}^n A_{ik} B_{kj}.$$

Each entry of $C$ is a dot product between a row of $A$ and a column of $B$. The operation is associative ($A(BC) = (AB)C$), distributive over addition ($A(B + C) = AB + AC$), but not commutative in general ($AB \neq BA$).

A linear layer in a neural network computes $Y = X W$ where $X \in \mathbb{R}^{B \times d_{\mathrm{in}}}$ is a batch of $B$ inputs of dimension $d_{\mathrm{in}}$ and $W \in \mathbb{R}^{d_{\mathrm{in}} \times d_{\mathrm{out}}}$ is the weight matrix. Attention computes $\mathrm{softmax}(Q K^\top / \sqrt{d_k}) V$, three matrix multiplications. Convolution can be implemented as matrix multiplication after the im2col rearrangement. Embedding lookup is matrix multiplication with a one-hot vector.

Matrix multiplication has time complexity $O(mnp)$ by the naive algorithm, the dominant cost of nearly every neural-network forward and backward pass. Strassen's algorithm (1969) reduces the exponent to $\log_2 7 \approx 2.807$; subsequent algorithms (Coppersmith–Winograd, Le Gall, AlphaTensor) push the exponent further down towards $\omega \approx 2.37$. In practice, none of these is faster than highly-optimised standard implementations on real hardware until $n \gg 1000$.

Modern GPUs and TPUs are architected around matrix-multiplication units (Nvidia Tensor Cores, Google TPU MXU) that perform $4 \times 4 \times 4$ or $128 \times 128 \times 128$ matrix multiplications as primitive instructions. The fused multiply-add (FMA) operation $a \cdot b + c$ is also hardware-primitive. Together these enable petaflop-per-second matrix multiplication.

Numerical considerations: in mixed-precision training (FP16/BF16/FP8), accumulation must be in higher precision (FP32) to avoid catastrophic cancellation. Gradient compression and quantised matrix multiplication (INT8, INT4) trade precision for speed in inference workloads.

Video

Related terms: Dot Product, Attention Mechanism, Tensor Cores

Discussed in:

Chapter 2: Linear Algebra, Linear Algebra

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.