82 entries
Visualisations
Narrated animations and interactive pieces, organised by chapter. Each opens to a page with the visualisation, transcript, and related glossary, people and references.
Chapter 1: What Is AI?
- Seven waves of AI , From Dartmouth in 1956, through expert systems and statistical ML, to deep learning and large language models.
- The perceptron learning rule , A line in 2D moves to separate red and blue points, one mistake at a time.
- The Turing test , An interrogator chats with two hidden players, one human, one machine, and decides which is which.
Chapter 2: Linear Algebra
- A matrix is a linear transformation , A 2×2 matrix morphs the plane through stretch, reflection, shear, and rotation.
- Eigenvectors are the directions a matrix doesn't rotate , While most vectors twist when a matrix is applied, eigenvectors only stretch along their own axis.
- SVD finds the best low-rank approximation , Keep the top singular values and a matrix becomes a sum of rank-one terms.
- The dot product as projection , Drop a perpendicular from a onto b: the dot product is the signed length of the shadow.
- The projection of one vector onto another , Drop a perpendicular from one vector to another to find its shadow, the dot product divided by length.
- Vector norms shape the unit ball , L1 makes a diamond, L2 a circle, L-infinity a square. Each defines distance differently.
Chapter 3: Calculus
- A polynomial that hugs the curve , Order zero is a constant, order one is a tangent line, order two is a parabola. Each adds a derivative.
- Gradient descent on a quadratic bowl , A ball rolls down a quadratic surface as the learning rate changes.
- Newton's method finds roots through tangent lines , Approximate a curve by its tangent at a guess and use the tangent's root as the next guess.
- Partial derivatives slice a surface , Hold one variable, slope along the other. Two partials make the gradient.
- The chain rule on a computation graph , Gradients multiply backward along the edges of a small graph from output to input.
- The gradient as a vector field , Tiny arrows on a contour plot point uphill, the steepest ascent direction at every location.
Chapter 4: Probability
- A zoo of distributions , Bernoulli, Gaussian, exponential and beta side by side, each shaped by its own parameters.
- Bayesian update, prior to posterior , A prior meets the likelihood and the posterior emerges between them.
- Correlation measures linear association , Pearson's r runs from minus one through zero to plus one. Visualise scatterplots at each value.
- Joint, marginal, and conditional distributions , A joint distribution lives over two axes. Marginalise to one axis, condition on a slice.
- Sums of any distribution become Gaussian , Roll one die, then two, then ten. The distribution of the average converges to a bell curve.
- The 68-95-99.7 rule , A Gaussian's tails fall off so fast that three standard deviations cover virtually all the probability.
Chapter 5: Statistics
- Bootstrap: resample with replacement , Build a sampling distribution from one dataset by drawing thousands of resamples.
- Confidence intervals catch the true mean , Repeated samples produce intervals. Ninety-five percent of them cover the unknown population mean.
- Maximum likelihood: peak of the likelihood curve , Sweep the parameter, plot the likelihood, take the maximum.
- Reject when the test statistic falls in the tail , Under the null hypothesis, the statistic has a known distribution. Extreme values lie in a small tail.
- The sampling distribution emerges , Repeated samples from a non-normal population produce Gaussian sample means.
Chapter 6: ML Fundamentals
- K-fold cross-validation , Split the training set into k folds, train on k minus one, validate on the last, rotate.
- Lasso vs Ridge: regularisation paths , As the penalty grows, Lasso sets coefficients to zero one by one; Ridge shrinks all together.
- Learning curves diagnose under- and over-fitting , Plot training and validation error against dataset size. The shapes reveal what's wrong.
- Overfitting and early stopping , Training loss keeps falling. Validation loss bottoms out, then rises. The gap is overfitting.
- The bias-variance tradeoff , Underfitting is high bias, overfitting is high variance. The best model balances the two.
Chapter 7: Supervised Learning
- An ensemble of decision trees votes , Each tree sees a different bootstrap sample and a different random feature subset, then they vote.
- K-nearest neighbours and Voronoi tessellation , K=1 carves the plane into Voronoi cells around training points; the boundary follows the cells.
- Logistic regression finds a boundary , A separating line learns its place by minimising cross-entropy on labelled points.
- Maximum margin: the widest gap that separates the classes , Among all separating lines, the SVM picks the one with the largest cushion on either side.
- The logistic curve maps any number to a probability , A linear score, squashed by a sigmoid, becomes a probability between zero and one.
Chapter 8: Unsupervised Learning
- A dendrogram emerges from agglomerative clustering , Start with each point its own cluster, merge the closest pair, repeat.
- Gaussian mixture models fit clusters via EM , Soft assignments and re-fitted Gaussians alternate until they settle.
- k-means clustering, iteration by iteration , The alternating assign-and-update loop converges three centres into clusters.
- Principal component analysis , Find the axis along which the data spreads most, then the next perpendicular axis, then the next.
- t-SNE unfolds high-dimensional clusters into 2D , Pairwise similarities in many dimensions become 2D positions that preserve neighbourhoods.
Chapter 9: Neural Networks
- A single hidden layer can fit any continuous function , Add hidden units one by one and watch the approximation tighten.
- Gradients flow backward through the layers , Forward pass produces a loss. Reverse pass propagates the gradient through every layer, multiplying local Jacobians.
- One neuron, forward pass and backward pass , Forward pass produces a value; backprop sends gradients back through the same graph.
- Sigmoid, tanh, ReLU, GELU side by side , Each squashes a real number through a nonlinear curve. The choice shapes how gradients flow.
- The XOR problem broke single-layer perceptrons , No single line separates XOR. A second layer fixes it instantly.
Chapter 10: Training & Optimisation
- Adam: per-parameter adaptive learning rates , Adam keeps a moving average of the gradient and the squared gradient. The ratio adapts each step.
- Batch normalisation centres and rescales activations , Subtract the batch mean, divide by the batch standard deviation, scale and shift back.
- Dropout zeros out a random subset of activations each forward pass , Half of the neurons are silenced randomly, forcing the network to spread information across many paths.
- Learning rate schedules , Warmup, then cosine decay: the learning rate's path through training matters as much as its peak.
- Momentum lets the ball coast through narrow valleys , Plain SGD oscillates across the walls; momentum smooths the path along the valley floor.
Chapter 11: CNNs
- 2D convolution, kernel slides over input , A 3×3 kernel sweeps a 9×9 input, filling in a feature map cell by cell.
- From LeNet to ResNet: depth grows, accuracy follows , LeNet-5, AlexNet, VGG, GoogLeNet, ResNet. Each year deeper, with new tricks.
- Max pooling and average pooling , A two by two patch reduces to one number. Max takes the largest, average takes the mean.
- Stacking convolutions grows the receptive field , A pixel in layer three sees a much bigger patch of the input than a pixel in layer one.
Chapter 12: Sequence Models
- An RNN is the same cell, applied at every time step , The recurrence unrolled across time becomes a deep feed-forward network with shared weights.
- Attention as alignment in seq2seq , An encoder produces hidden states; the decoder weights them dynamically per output token.
- LSTM cell and the constant-error carousel , The forget gate keeps gradients alive over long sequences.
- Multiply many small derivatives and the gradient vanishes , Repeated multiplication of fractions less than one drives the gradient toward zero in deep networks.
Chapter 13: Attention & Transformers
- Causal masking forces a transformer to look only at the past , An upper-triangular mask sets future attention scores to minus infinity.
- Multiple attention heads in parallel , Each head learns a different similarity pattern. Their outputs concatenate and project to one tensor.
- Self-attention as Q–K–V dot products , Query, key and value vectors produce an attention matrix over four tokens.
- Sinusoidal positional encodings , Sines and cosines of many frequencies tag each position with a unique fingerprint.
Chapter 14: Generative Models
- Diffusion sampling, from noise to image , Start at pure Gaussian noise, denoise step by step, and a structure emerges.
- GAN training is a two-player game , The generator tries to fool the discriminator. The discriminator tries not to be fooled. Equilibrium is realistic samples.
- Latent space interpolation in a VAE , Walk a straight line between two latent codes and the decoded image morphs smoothly.
- Score matching learns the gradient of log density , The score points uphill on the data density; matching it lets you sample by stochastic ascent.
- The forward diffusion process: adding noise step by step , An image gradually corrupted by Gaussian noise becomes pure static.
Chapter 15: Modern AI
- Chain-of-thought prompting , Asking a model to think step by step before answering improves accuracy on multi-step problems.
- Few-shot examples teach the model in the prompt , A handful of input-output pairs in the prompt steer a frozen model to a new task.
- Inside a transformer block , Multi-head attention, a feed-forward network, residual connections and layer norm: the building block of every modern LLM.
- Mixture of experts: a router selects k specialists , Each token activates only a few experts; the network grows in capacity without growing in compute per token.
- Scaling laws: compute, data, and parameters jointly determine loss , Plot loss against compute on a log-log scale and you get a clean line.
- Test-time compute scaling , Accuracy as a function of inference budget for three model strengths.
Chapter 16: Ethics & Safety
- An adversarial example: a tiny perturbation flips the prediction , Add an imperceptible pattern to a panda image; the network now sees a gibbon with high confidence.
- Demographic parity vs equalised odds , Different fairness criteria pull a classifier in different directions, and they cannot all hold at once.
- Mesa-optimisation: an objective hidden inside a learned model , The base optimiser trains a model that is itself an optimiser, with its own learned objective.
- The fairness-accuracy frontier , Push fairness up, accuracy often drops. The Pareto frontier shows the best trade.
Chapter 17: Applications
- AlphaFold predicts protein structure from sequence , Twenty amino acids in a chain fold into a unique three-dimensional ribbon, predicted by attention.
- AlphaGo's Monte Carlo tree search , MCTS expands promising moves, simulates rollouts, and backs up scores.
- Protein folding, sequence to structure , An extended chain of residues collapses into a compact 3D structure.
- Two-tower recommendation: users and items in shared embedding space , User tower and item tower learn embeddings; relevance is the dot product.
This site is currently in Beta. Contact: Chris Paton
Textbook of Usability · Textbook of Digital Health
Auckland Maths and Science Tutoring
AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).