2.13 How this chapter connects forward
The mathematics of this chapter reappears throughout the book, often without commentary:
- Probability and statistics (Chapters 4 and 5). Covariance matrices are symmetric positive semi-definite; their eigendecompositions diagonalise multivariate Gaussians. The Mahalanobis distance is a quadratic form. The Fisher information matrix is the Hessian of the negative log-likelihood. The Cramér–Rao lower bound is a statement about its inverse.
- Machine learning fundamentals (Chapters 6, 7, and 8). Linear regression is projection onto the column space of the design matrix. Logistic regression linearises before applying a sigmoid. Support vector machines maximise a margin defined by an $\ell^2$ norm; their dual is a quadratic program. Kernel methods replace explicit feature vectors with positive-definite Gram matrices.
- Neural networks (Chapters 9, 11–14). Every layer is a matrix multiplication or a generalised linear operator (convolution, attention) followed by a non-linearity. Backpropagation is the chain rule for matrix calculus. Batch normalisation, layer normalisation, weight initialisation schemes, and gradient clipping all rest on the linear-algebra ideas above.
- Attention and Transformers (Chapter 13). Self-attention is
softmax(QK^T / sqrt(d)) V. The query and key matrices project tokens into a space where dot products measure relevance. Multi-head attention concatenates several such projections, a block-matrix decomposition. - Generative models (Chapter 14). Variational autoencoders parametrise Gaussian families with mean and covariance matrices. Normalising flows build invertible transformations whose Jacobian determinants are tractable. Diffusion models apply repeated linear noise operators.
- Modern systems (Chapter 15). LoRA fine-tuning adds a learned rank-$r$ correction $\mathbf{B}\mathbf{A}$ to a frozen weight matrix. Mixture of experts routes tokens through different block-diagonal weight matrices. Quantisation reframes weights in coarser numerical representations.
Once you internalise the picture in this chapter, vectors as embeddings, matrices as linear maps, eigenvalues as preserved directions, SVD as rotation–stretch–rotation, gradients as directions of steepest ascent, almost everything else is variations on these themes plus carefully chosen non-linearities.