- Represent vectors algebraically and geometrically and compute vector norms, sums, and scalar multiples
- Use the dot product to measure similarity, projection, and angle between vectors
- Perform matrix operations (addition, multiplication, transpose, inverse) and interpret them as linear transformations
- Explain the meaning of eigenvalues and eigenvectors and their role in PCA, PageRank, and spectral methods
- Describe how embeddings map discrete objects (words, users, items) into continuous vector spaces where similarity becomes geometric
When a neural network processes an image, it multiplies pixel values by weight matrices. When a search engine ranks pages, it decomposes a huge term–document matrix. When Netflix recommends a film, it computes similarity between vectors. Linear algebra is the language behind all of this.
This chapter builds that language from the ground up: vectors, the dot product, matrices, eigenvalues, and embeddings. Throughout, the focus is on geometric intuition — seeing what a matrix does to a vector — alongside the algebra.
2.1 Vectors
A vector is an ordered list of numbers: v = (v₁, v₂, …, vₙ). In 2D, you can draw it as an arrow from the origin to a point. Beyond 3D, you cannot draw it, but the maths works the same — and many AI systems routinely work in thousands or millions of dimensions.
Basic Operations
- Addition: u + v = (u₁ + v₁, u₂ + v₂, …). Geometrically, this is the parallelogram rule: place v's tail at u's tip and draw the arrow from the origin to the new tip.
- Scalar multiplication: cv = (cv₁, cv₂, …). Stretches or shrinks the vector (and reverses its direction if c is negative).
These two operations define a vector space. The definition is abstract: any set of objects that supports addition and scalar multiplication (subject to a handful of axioms) qualifies as one — not just lists of numbers.
Norms
The norm measures a vector's length:
- L₂ (Euclidean): ‖v‖ = √(v₁² + v₂² + …). The most common.
- L₁: sum of absolute values. Encourages sparsity in model weights.
- L∞: maximum absolute value. Appears in adversarial robustness.
A unit vector has norm 1. Normalisation divides by the norm so vectors of different magnitudes can be compared fairly.
Vectors as Data
In AI, a vector usually represents a data point. An image with 224 × 224 pixels and 3 colour channels flattens to a vector of length 150,528. A document becomes a bag-of-words vector. A user becomes a vector of interaction features. Seeing data as vectors lets you apply distances, angles, and transformations to things that are not inherently mathematical.
Linear Independence and Bases
A set of vectors is linearly independent if none can be written as a combination of the others. The maximum number of independent vectors in a space is the dimension. A set of n independent vectors in n-dimensional space forms a basis: any vector in the space can be expressed as a unique combination of those basis vectors. The choice of basis is not unique — there are infinitely many bases for any given space. Choosing a good basis (one that reveals data structure) is a recurring theme — from PCA to autoencoders.
2.2 The Dot Product
The dot product (also called the inner product or scalar product) of two vectors is u · v = Σᵢ uᵢvᵢ = u₁v₁ + u₂v₂ + … + uₙvₙ. It takes two vectors and returns a single number. Cheap to compute (n multiplications, n − 1 additions) and geometrically rich.
Geometric Meaning
u · v = ‖u‖ ‖v‖ cos θ, where θ is the angle between them.
- Positive: same general direction.
- Zero: orthogonal (perpendicular). Orthogonal vectors carry no redundant information.
- Negative: opposite directions.
Dot Products in ML
A linear classifier scores an input x by computing w · x + b. The weight vector w defines a dividing hyperplane. The dot product measures which side x falls on — and how far from the boundary it is (which relates to confidence). This picture underlies logistic regression, SVMs, and individual neurons in a neural network.
Cosine Similarity
cos θ = (u · v) / (‖u‖ ‖v‖). Ranges from −1 to +1. It ignores magnitude and compares only direction — ideal for items that differ in scale but not character (documents of different lengths, users with different activity levels). Cosine similarity is the standard metric in retrieval, recommendation, and embedding-based search.
Projections
The projection of u onto v decomposes u into a component along v and a component perpendicular to it. The scalar projection is (u · v) / ‖v‖. The vector projection is [(u · v) / ‖v‖²] v. Projections are the geometric heart of least-squares regression — you are projecting the target vector onto the column space of the design matrix — and of Gram–Schmidt orthogonalisation and many dimensionality reduction methods.
2.3 Matrices
A matrix is a rectangular grid of numbers with m rows and n columns. Every matrix defines a linear transformation — a function that maps vectors to vectors while preserving addition and scalar multiplication. Conversely, every linear transformation can be represented by a unique matrix. This two-way correspondence is one of the central insights of linear algebra.
Matrices in AI
- Neural network weights: a layer mapping 512-dimensional input to 256-dimensional output has a 256 × 512 weight matrix.
- Data matrices: m examples × n features.
- Images: height × width pixels.
- Attention scores: query–key similarity matrices.
Special Matrices
- Identity I: ones on the diagonal, zeros elsewhere. AI = A.
- Diagonal: non-zero only on the diagonal. Scales each dimension independently.
- Symmetric: A = A^T^. Arises from pairwise similarities and covariances. All eigenvalues are real.
Transpose and Inverse
The transpose A^T^ swaps rows and columns. It appears constantly — for example, X^T^X (the Gram matrix) is central to linear regression.
The inverse A^−1^ is the unique matrix satisfying A****A^−1^ = A^−1^A = I. A matrix that has an inverse is called invertible (or non-singular). A matrix with no inverse is singular — its rows or columns are linearly dependent, meaning the transformation collapses some dimension of the input.
Rank
The rank is the number of linearly independent rows (or columns). A rank-r matrix maps inputs onto an r-dimensional subspace. Low-rank matrices capture data with hidden low-dimensional structure. Low-rank factorisation powers collaborative filtering (Netflix Prize), latent semantic analysis, and LoRA Hu, 2021 for fine-tuning large language models.
Determinant
The determinant det(A) measures how a transformation scales volumes. Zero determinant means the matrix is singular. The sign tells whether the transformation flips orientation. Determinants appear in multivariate Gaussian densities and likelihood formulas.
2.4 Matrix Operations
Matrix–Vector Multiplication
Ax gives an m-dimensional output whose i-th entry is the dot product of row i with x. The "column view" is often more revealing: Ax is a linear combination of the columns of A, with x providing the mixing weights.
Matrix–Matrix Multiplication
(AB)ij = dot product of row i of A with column j of B. Requires matching inner dimensions. Matrix multiplication is associative but not commutative — order matters.
Neural Network Layers
A single layer computes y = f(Wx + b) Goodfellow, 2016. The matrix multiplication Wx is the computational bottleneck. GPUs and TPUs are built specifically to make this fast. The entire deep learning revolution can be seen as the consequence of making matrix multiplication fast enough for billions of parameters.
Other Operations
- Hadamard product (A ⊙ B): element-wise multiplication. Same shape in, same shape out. Used in LSTM gates, attention, and normalisation.
- Outer product uv^T^: produces a rank-one matrix. Any matrix can be decomposed as a sum of outer products.
- Trace tr(A): sum of diagonal entries. Equals the sum of eigenvalues. The trace is invariant under cyclic permutations: tr(ABC) = tr(CAB) = tr(BCA). This property appears often when deriving gradients involving matrix expressions.
- Frobenius norm ‖A‖
F: √(Σ aij²). Measures matrix "size," used for comparing weight matrices and regularisation. - Broadcasting: not formal linear algebra, but essential in NumPy and PyTorch. Defines rules for element-wise operations on arrays of different shapes. Many subtle bugs come from misunderstanding broadcasting.
2.5 Eigenvalues & Eigenvectors
An eigenvector of matrix A is a direction that the transformation preserves: Av = λv. The scalar λ is the eigenvalue — the factor by which the vector is stretched (or compressed, or reversed). Every other vector, in general, gets rotated or sheared as well as scaled. What makes eigenvectors special is that they keep pointing the same way.
Finding Them
Solve det(A − λI) = 0 to get eigenvalues. This is a polynomial of degree n, so it has exactly n roots (counting multiplicity, and allowing complex numbers). Then solve (A − λI)v = 0 for each λ to get eigenvectors. In practice, iterative algorithms (QR, Lanczos) handle large matrices.
Eigendecomposition
If the eigenvectors form a basis: A = QΛQ^−1^. Not all matrices have this decomposition — the eigenvectors must span the space. But symmetric matrices always do, and in that case Q is orthogonal (Q^T^ = Q^−1^), giving A = QΛQ^T^. This is the spectral theorem: every symmetric matrix is just a diagonal scaling in the right coordinate system.
SVD
The singular value decomposition works for any matrix, not just square ones. It factors A (m × n) as A = UΣV^T^, where U (m × m) and V (n × n) are orthogonal, and Σ is diagonal with non-negative entries called singular values. Truncating to the top k singular values gives the best rank-k approximation (Eckart–Young theorem).
Applications in AI
- PCA: eigendecompose the covariance matrix. Eigenvectors = directions of maximum variance. Eigenvalues = how much variance each captures. Project onto the top k for dimensionality reduction.
- PageRank: the principal eigenvector of the web graph's modified adjacency matrix.
- Spectral clustering: eigenvectors of a graph Laplacian partition data into clusters.
- Condition number: ratio of largest to smallest singular value. Determines how sensitive a computation is to numerical error.
- Training dynamics: the eigenvalues of the Hessian (the matrix of second derivatives of the loss) govern how fast gradient descent converges in each direction. Large eigenvalue ratios cause slow, oscillating training.
2.6 Embeddings
Real-world data — words, users, molecules — is often discrete, sparse, and high-dimensional. Embeddings map these entities into dense, low-dimensional vector spaces where standard operations (dot products, distances) become meaningful and where geometry captures semantic relationships.
Word2Vec
Mikolov et al. (2013) Mikolov, 2013 trained a shallow network to predict words from context (CBOW) or context from words (skip-gram). The learned vectors exhibited regular structure: vec("king") − vec("man") + vec("woman") ≈ vec("queen"). Embeddings could capture meaning, not just co-occurrence.
Embeddings Everywhere
- NLP: contextualised embeddings from BERT Devlin, 2019 and GPT give different vectors for the same word depending on context.
- Vision: CNNs and ViTs learn image embeddings where similar images are nearby.
- Recommendations: users and items share an embedding space. Recommendations = nearest items to a user's vector.
- Drug discovery: molecules are embedded and properties are predicted from the vectors.
What Makes a Good Embedding?
The geometry must reflect the relationships that matter. Synonyms should be close. Similar users should be close. Similar molecules should be close. This is achieved by training the embedding as part of a larger model, using loss functions that pull similar entities together and push different ones apart (contrastive learning, triplet loss, cross-entropy).
Dimensionality
Too few dimensions: cannot capture the richness. Too many: overfits and wastes memory. Typical ranges: words 100–1,000; images 512–2,048; recommendations 32–128.
Bias
Embeddings absorb biases from their training data Bender, 2021. Word embeddings trained on web text encode gender and racial stereotypes. These biases propagate to downstream systems. Debiasing techniques exist but the problem is not fully solved. Any practitioner deploying embedding-based systems must understand and mitigate these risks. Chapter 16 returns to this topic.