Linear Algebra: 2.14   Summary

Dr Chris Paton

2.14 Summary

A vector is a list of numbers; equivalently, an arrow from the origin in $\mathbb{R}^n$; equivalently, the encoding of an object as a point in a learned embedding space.

Norms measure size. The Euclidean ($\ell^2$) norm is rotation-invariant and smoothly differentiable, the canonical choice for distances and weight decay. The $\ell^1$ norm encourages sparsity. The $\ell^\infty$ norm appears in adversarial robustness. Cosine similarity ignores magnitude and compares only direction.

A matrix is the encoding of a linear transformation in a chosen basis. Matrix multiplication composes transformations. The four fundamental subspaces, column space, null space, row space, left null space, capture what a linear map does and does not do; the rank–nullity theorem is the bookkeeping.

Determinants measure signed volume scaling; trace is the sum of diagonal entries and the sum of eigenvalues; both are similarity invariants.

Eigenvalues and eigenvectors are the directions a linear map preserves and the factors by which it stretches them. The spectral theorem says symmetric matrices are diagonalisable in an orthonormal basis. Power iteration computes the dominant eigenvector by repeated multiplication.

The singular value decomposition extends eigendecomposition to arbitrary rectangular matrices. Geometrically, every linear map is rotation–stretch–rotation. The Eckart–Young theorem says truncating to the largest $k$ singular values gives the best rank-$k$ approximation under Frobenius and spectral norms. PCA is SVD of the centred data.

Matrix calculus follows a small set of rules, gradient of a linear function, of a quadratic form, of a squared residual, applied through the chain rule. Backpropagation is matrix calculus on the computation graph of a neural network.

Tensors generalise matrices to higher rank. Broadcasting and einsum let us write multi-axis operations cleanly. Mind the contraction order, it can change cost by orders of magnitude.

Numerical stability is not optional. Avoid forming $\mathbf{X}^\top \mathbf{X}$ where you can use QR or SVD. Subtract the maximum before exponentiating in softmax. Track the condition number when you suspect a problem is ill-posed.

Hardware is built for matrix multiplication. Tensor cores reward mixed-precision computation in fp16, bf16, or fp8. Memory traffic, not arithmetic, is the bottleneck for most non-matmul kernels. Distributed training shards the same linear algebra across many devices.