Linear Algebra: 2.1   Why linear algebra is the language of AI

Dr Chris Paton

2.1 Why linear algebra is the language of AI

Linear algebra is the mathematical language in which every modern artificial-intelligence system is described and computed. When a researcher writes down a neural network on a whiteboard, the symbols they use are vectors and matrices; when an engineer trains that network on a graphics processor, the operations the chip executes are vector and matrix operations; when a deployed model classifies an email or generates a sentence, what is actually happening, electrically, is a long sequence of multiplications and additions of numbers arranged in lists and grids. Vectors represent data points, an email becomes a vector, an image becomes a vector, a customer's purchase history becomes a vector, and matrices represent transformations, such as a single layer of a neural network, a database query that selects and reweights features, or a change of coordinates from pixels to edges. Every algorithm in this textbook, from the linear regression of Chapter 6 through the transformer of Chapter 13, can be expressed as compositions of vector and matrix operations. Once you understand the small handful of operations that linear algebra provides, you understand the substrate on which all of modern AI runs.

The formal machinery begins in §2.2. The notation introduced in this chapter reappears throughout Chapters 9 to 15, where neural networks are built on top of it; the apparent complexity of a 100-billion-parameter language model collapses, on inspection, into a small number of repeated patterns.

Symbols Used Here

$\mathbf{x}$a vector (a list of real numbers); usually represents one data example, written in bold lower-case

$\mathbf{X}$a matrix (a two-dimensional array of numbers); often a batch of examples or a transformation, written in bold upper-case

$\mathbb{R}^n$the set of $n$-dimensional vectors of real numbers (so $\mathbf{x} \in \mathbb{R}^{784}$ means $\mathbf{x}$ is a list of 784 real numbers)

$\mathbb{R}^{m \times n}$the set of matrices with $m$ rows and $n$ columns

$\mathbf{X} \mathbf{W}$matrix multiplication of $\mathbf{X}$ by $\mathbf{W}$ (the order matters)

$\mathbf{X}^\top$the transpose of $\mathbf{X}$ (rows and columns swapped)

$\|\cdot\|$the norm, or length, of a vector

$\mathbf{0}$the zero vector (every entry equal to zero)

$\mathbf{I}$the identity matrix (ones on the diagonal, zeros elsewhere)

Three concrete examples

The fastest way to see why linear algebra matters is to follow three real systems through their core computation. Each example below describes a deployed AI application, identifies the vector or matrix that represents the input, and shows that the bulk of the work is a matrix multiplication.

Example 1: A spam filter. Imagine an inbox protected by a classifier that decides, for each incoming email, whether the message is spam or legitimate. The first thing the filter must do is convert the email, which arrives as raw text, a string of characters, into a numerical object the model can compute on. A simple and historically successful method is the bag-of-words representation. The system maintains a fixed vocabulary of, say, 30000 words. For each incoming email, it counts how many times each vocabulary word appears in the message. The result is a list of 30000 integers: the first entry might be the count of aardvark, the second the count of abacus, and so on, up to the 30000th entry which might be the count of zucchini. We write this list as a vector $\mathbf{x} \in \mathbb{R}^{30000}$. Most entries are zero, because no single email uses more than a few hundred distinct words.

The classifier itself is described by two more objects: a weight vector $\mathbf{w} \in \mathbb{R}^{30000}$, learned from a training set of labelled emails, and a single number $b$ called the bias. The decision rule is

$$\hat y = \mathbf{w}^\top \mathbf{x} + b,$$

where $\hat y$ is the model's score (positive means spam, negative means legitimate) and $\mathbf{w}^\top \mathbf{x}$ denotes the dot product of $\mathbf{w}$ and $\mathbf{x}$, the sum, over all 30000 vocabulary positions, of the weight times the count. The dot product is the single most important operation in this entire chapter, and it is worth pausing on. It is the result of multiplying two vectors of equal length entry by entry and then adding all those products together, producing one number. The model's entire decision about whether to send an email to spam is one dot product. Words that the training procedure found to be predictive of spam (viagra, winner, click) end up with large positive weights; words predictive of legitimate mail get large negative weights; neutral words get weights near zero. The dot product mixes thousands of weak signals into a single, calibrated number.

Example 2: Image classification. A modern convolutional neural network (CNN) takes a colour image of size 224 pixels high by 224 pixels wide, with three colour channels (red, green, blue). Stored naively, that image is a three-dimensional array with $224 \times 224 \times 3 = 150\,528$ numbers. If you flatten the array, treating it as one long list, you get a vector $\mathbf{x} \in \mathbb{R}^{150528}$. That is the input to the network.

The network itself is a stack of layers. A typical CNN has between 50 and 200 layers; deep image networks like ResNet-152 have, as the name suggests, 152. Each layer applies a matrix multiplication followed by a non-linear function called an activation. For a layer that takes $n$-dimensional input and produces $m$-dimensional output, the core computation is

$$\mathbf{h}_{\text{out}} = \sigma(\mathbf{W}\mathbf{h}_{\text{in}} + \mathbf{b}),$$

where $\mathbf{W} \in \mathbb{R}^{m \times n}$ is the layer's weight matrix, $\mathbf{b} \in \mathbb{R}^m$ is its bias vector, $\sigma$ is the activation, and $\mathbf{h}$ stands for the layer's hidden representation of the image. A network with 100 layers performs roughly 100 such matrix multiplications to classify a single image. Training, of course, processes many images at once. With a batch size of 32, meaning 32 images shuttled through the network in parallel, every layer does its matrix multiplication on a $32 \times n$ batch matrix rather than on a single $n$-vector. That is 100 batched matrix multiplications per training step, equivalent to $100 \times 32 = 3{,}200$ per-example linear transformations, and a typical training run executes hundreds of thousands of training steps. The bulk of training compute, by an enormous margin, is matrix multiplication. Everything else, loading data, computing losses, applying activations, is rounding error in comparison.

Example 3: A language model. When you type a prompt into ChatGPT or any other large language model, the system first breaks your text into tokens, which are short character sequences chosen by a learned vocabulary. The phrase "Hello, world!" might become five tokens. Each token is then converted into a vector of, say, 4096 real numbers, by looking it up in an embedding matrix, a giant $V \times 4096$ table where $V$ is the vocabulary size and each row is the embedding of one token.

Once the prompt is a sequence of 4096-dimensional vectors, it passes through the transformer's stack of layers. A single transformer layer applies several matrix multiplications. The attention mechanism, which is what allows the model to relate each token to every other token in the prompt, computes three different projections of every token vector, called the query, the key, and the value, by multiplying by three separate weight matrices $\mathbf{W}_Q$, $\mathbf{W}_K$, $\mathbf{W}_V$. Then it computes attention scores by another matrix multiplication. After attention, a feed-forward block applies two further matrix multiplications. The output projection adds one more. A 100-layer model with 100 billion parameters will, for every single token it generates, perform hundreds of matrix multiplications operating on intermediate arrays totalling several gigabytes. When ChatGPT pauses for a fraction of a second and then begins streaming text back to you, that pause is the time it takes to do one round of matrix multiplications; each subsequent token costs another full sweep through the network. If you have ever wondered why language models are expensive to run, the answer is in this paragraph.

Three different applications, three different domains, text, vision, language, and the same operation does the work in all of them. That operation is matrix multiplication, applied to vectors that represent the input. This is the practical sense in which linear algebra is the language of AI: not as a piece of mathematical scaffolding bolted onto a fundamentally different subject, but as the actual, executed, electrically realised computation.

Hardware is built for it

There is one more piece of the picture, and it is not mathematical but physical. Modern computer hardware is built specifically to make matrix multiplication fast. A single NVIDIA H100 GPU, the workhorse training chip of the early 2020s, performs roughly $10^{15}$ floating-point operations per second on dense matrix multiplications using its specialised tensor cores. The same card runs general-purpose code, the kind of branching, irregular code you would write for a sorting algorithm or a database query, at perhaps $10^{13}$ operations per second. The factor of 100 between those two numbers is the central economic and engineering fact of modern AI. There is a two-order-of-magnitude reward, in throughput, for expressing your computation as matrix multiplications.

This reward is so large that it shapes which algorithms become practical at scale. The deep-learning revolution that began around 2012, with the AlexNet image classifier, then word embeddings, then sequence-to-sequence models, then transformers, was, in large part, the cumulative realisation that if you can phrase your model as Y = activation(X @ W + b) repeated many times (where @ is the Python notation for matrix multiplication), you can train it on hardware that already exists. Algorithms that fit this shape inherit the full speed of the chip. Algorithms that do not fit must either be rewritten until they do or accept a 10-to-100-times performance penalty. The penalty is so steep that, in practice, the field has gravitated strongly to architectures that reduce to matrix products. Methods like belief propagation on irregular graphs, or symbolic theorem-proving with deep search trees, or particle filters with millions of particles, do exist and are useful in their niches; but the headline systems, the trillion-parameter language models, the photorealistic image generators, the code-writing assistants, are all matrix-multiplication machines.

Tensor processing units (TPUs), GPU tensor cores, and the neural processing units (NPUs) appearing in mobile phones and laptops are all, fundamentally, physical implementations of fast matrix multiplication. They contain large grids of multiply-accumulate units arranged in patterns called systolic arrays, which stream rows of one matrix and columns of another past each other so that every clock cycle produces one entry of the product. The chip designers have committed silicon area, power budget, and memory bandwidth to one operation; everything else is secondary. When you read in the news that a company has built a 100-megawatt data centre full of GPUs, you should hear: we have built a building whose purpose is to multiply matrices very, very quickly.

This is why the abstract definitions you will meet in the next few sections, vector, matrix, dot product, matrix product, transpose, are not arbitrary choices. They are the operations the hardware makes cheap. The mathematics and the electronics co-evolved, and the result is that learning linear algebra now is also learning how the silicon thinks.

What the rest of the chapter covers

The rest of this chapter builds up the machinery the three examples above used informally. Use the list below as a map.

§2.2 Vectors, vector spaces, and norms. The unit of representation. We define what a vector is, how to add and scale them, and how to measure their length. The norm is what tells you whether two vectors are close, which underlies every notion of similarity in machine learning.
§2.3 Matrices and matrix multiplication. The unit of transformation. We define matrices, the rules for multiplying them, and the standard operations (transpose, inverse, trace). Most of the rest of the chapter is the structural theory of this one operation.
§2.4 Linear maps, rank, and the four subspaces. The structural theory. Matrices encode linear maps between vector spaces; the rank measures how much they preserve, and the four fundamental subspaces (column space, null space, row space, left null space) describe their domain and range geometrically. This is what tells you when a system of equations has a unique solution.
§2.4a Matrix factorisations: LU, Cholesky, QR. Efficient algorithms for solving systems of equations. The factorisations are the backbone of every numerical linear-algebra library, including the one your deep-learning framework calls under the hood.
§2.5 Determinants and trace. Geometric and spectral invariants. The determinant measures how a matrix scales volumes; the trace is the sum of the diagonal and equals the sum of the eigenvalues.
§2.6 Eigenvalues and eigenvectors. Directions in which a transformation acts as a single scalar. Eigenvalues control the long-run behaviour of repeated multiplication, which is why they show up in the analysis of optimisation, stability, and Markov chains.
§2.7 The singular value decomposition (SVD). The most useful single decomposition in applied mathematics. Every matrix has one; it tells you simultaneously about the rank, the dominant directions, and the inverse.
§2.8 Principal component analysis (PCA) via SVD. Applied dimensionality reduction. PCA is the first thing a working data scientist tries on a high-dimensional dataset, and it falls out of SVD in two lines.
§2.9 Matrix calculus and vector derivatives. Gradients of vector-valued functions. This is the bridge to backpropagation: training a neural network is gradient descent in a space of millions of parameters, and matrix calculus is what makes that tractable.
§2.10 Tensors, broadcasting, and einsum. Generalisation to higher dimensions. Real deep-learning code works with arrays of three, four, or even five dimensions, and the conventions for indexing and combining them deserve careful treatment.
§2.11 Numerical stability. Why theory and float32 disagree. Floating-point arithmetic is not real arithmetic, and the difference matters whenever a number is very large, very small, or very close to another.
§2.12 Hardware reality. Why memory layout matters. Two algorithms with identical mathematical content can differ by a factor of 10 in wall-clock speed depending on how they touch memory.
§2.13 How this chapter connects forward. A pointer to where each piece will reappear in later chapters.

If you have never seen linear algebra before, the right strategy is to read §2.2 and §2.3 carefully, those two sections, together with the examples above, will carry you a long way, and then to skim §2.4 through §2.13 for orientation, returning when later chapters call on a specific result.

What you should take away

Five points to carry forward from this section into the rest of the book.

Linear algebra is the lingua franca of AI. Every architecture, every algorithm, every trained model in this textbook is, at the level of executed computation, a sequence of vector and matrix operations.
Data are vectors; transformations are matrices. The first step in any AI system is to encode the input, text, image, sound, table, as a vector of numbers. The next step is to apply matrices that transform those numbers into useful predictions.
The dot product is the most-used operation. Whether you are scoring a spam email, computing one entry of a matrix product, or measuring similarity between two embeddings, the dot product is doing the work.
Modern hardware is built for matrix multiplication. GPUs, TPUs, and NPUs are physical implementations of fast matrix arithmetic, and the 100-times speed advantage shapes which algorithms become practical.
Everything else is detail. Norms, ranks, eigenvalues, decompositions, derivatives, all the apparatus of the rest of this chapter, exists to analyse, accelerate, and stabilise the basic pattern of vectors transformed by matrices.