Visualisations, Textbook of AI

82 entries

Visualisations

Narrated animations and interactive pieces, organised by chapter. Each opens to a page with the visualisation, transcript, and related glossary, people and references.

Chapter 1: What Is AI?

Seven waves of AI , From Dartmouth in 1956, through expert systems and statistical ML, to deep learning and large language models.
The perceptron learning rule , A line in 2D moves to separate red and blue points, one mistake at a time.
The Turing test , An interrogator chats with two hidden players, one human, one machine, and decides which is which.

Chapter 2: Linear Algebra

A matrix is a linear transformation , A 2×2 matrix morphs the plane through stretch, reflection, shear, and rotation.
Eigenvectors are the directions a matrix doesn't rotate , While most vectors twist when a matrix is applied, eigenvectors only stretch along their own axis.
SVD finds the best low-rank approximation , Keep the top singular values and a matrix becomes a sum of rank-one terms.
The dot product as projection , Drop a perpendicular from a onto b: the dot product is the signed length of the shadow.
The projection of one vector onto another , Drop a perpendicular from one vector to another to find its shadow, the dot product divided by length.
Vector norms shape the unit ball , L1 makes a diamond, L2 a circle, L-infinity a square. Each defines distance differently.

Chapter 3: Calculus

A polynomial that hugs the curve , Order zero is a constant, order one is a tangent line, order two is a parabola. Each adds a derivative.
Gradient descent on a quadratic bowl , A ball rolls down a quadratic surface as the learning rate changes.
Newton's method finds roots through tangent lines , Approximate a curve by its tangent at a guess and use the tangent's root as the next guess.
Partial derivatives slice a surface , Hold one variable, slope along the other. Two partials make the gradient.
The chain rule on a computation graph , Gradients multiply backward along the edges of a small graph from output to input.
The gradient as a vector field , Tiny arrows on a contour plot point uphill, the steepest ascent direction at every location.

Chapter 4: Probability

A zoo of distributions , Bernoulli, Gaussian, exponential and beta side by side, each shaped by its own parameters.
Bayesian update, prior to posterior , A prior meets the likelihood and the posterior emerges between them.
Correlation measures linear association , Pearson's r runs from minus one through zero to plus one. Visualise scatterplots at each value.
Joint, marginal, and conditional distributions , A joint distribution lives over two axes. Marginalise to one axis, condition on a slice.
Sums of any distribution become Gaussian , Roll one die, then two, then ten. The distribution of the average converges to a bell curve.
The 68-95-99.7 rule , A Gaussian's tails fall off so fast that three standard deviations cover virtually all the probability.

Chapter 5: Statistics

Bootstrap: resample with replacement , Build a sampling distribution from one dataset by drawing thousands of resamples.
Confidence intervals catch the true mean , Repeated samples produce intervals. Ninety-five percent of them cover the unknown population mean.
Maximum likelihood: peak of the likelihood curve , Sweep the parameter, plot the likelihood, take the maximum.
Reject when the test statistic falls in the tail , Under the null hypothesis, the statistic has a known distribution. Extreme values lie in a small tail.
The sampling distribution emerges , Repeated samples from a non-normal population produce Gaussian sample means.

Chapter 6: ML Fundamentals

K-fold cross-validation , Split the training set into k folds, train on k minus one, validate on the last, rotate.
Lasso vs Ridge: regularisation paths , As the penalty grows, Lasso sets coefficients to zero one by one; Ridge shrinks all together.
Learning curves diagnose under- and over-fitting , Plot training and validation error against dataset size. The shapes reveal what's wrong.
Overfitting and early stopping , Training loss keeps falling. Validation loss bottoms out, then rises. The gap is overfitting.
The bias-variance tradeoff , Underfitting is high bias, overfitting is high variance. The best model balances the two.

Chapter 7: Supervised Learning

An ensemble of decision trees votes , Each tree sees a different bootstrap sample and a different random feature subset, then they vote.
K-nearest neighbours and Voronoi tessellation , K=1 carves the plane into Voronoi cells around training points; the boundary follows the cells.
Logistic regression finds a boundary , A separating line learns its place by minimising cross-entropy on labelled points.
Maximum margin: the widest gap that separates the classes , Among all separating lines, the SVM picks the one with the largest cushion on either side.
The logistic curve maps any number to a probability , A linear score, squashed by a sigmoid, becomes a probability between zero and one.

Chapter 8: Unsupervised Learning

A dendrogram emerges from agglomerative clustering , Start with each point its own cluster, merge the closest pair, repeat.
Gaussian mixture models fit clusters via EM , Soft assignments and re-fitted Gaussians alternate until they settle.
k-means clustering, iteration by iteration , The alternating assign-and-update loop converges three centres into clusters.
Principal component analysis , Find the axis along which the data spreads most, then the next perpendicular axis, then the next.
t-SNE unfolds high-dimensional clusters into 2D , Pairwise similarities in many dimensions become 2D positions that preserve neighbourhoods.

Chapter 9: Neural Networks

A single hidden layer can fit any continuous function , Add hidden units one by one and watch the approximation tighten.
Gradients flow backward through the layers , Forward pass produces a loss. Reverse pass propagates the gradient through every layer, multiplying local Jacobians.
One neuron, forward pass and backward pass , Forward pass produces a value; backprop sends gradients back through the same graph.
Sigmoid, tanh, ReLU, GELU side by side , Each squashes a real number through a nonlinear curve. The choice shapes how gradients flow.
The XOR problem broke single-layer perceptrons , No single line separates XOR. A second layer fixes it instantly.

Chapter 10: Training & Optimisation

Adam: per-parameter adaptive learning rates , Adam keeps a moving average of the gradient and the squared gradient. The ratio adapts each step.
Batch normalisation centres and rescales activations , Subtract the batch mean, divide by the batch standard deviation, scale and shift back.
Dropout zeros out a random subset of activations each forward pass , Half of the neurons are silenced randomly, forcing the network to spread information across many paths.
Learning rate schedules , Warmup, then cosine decay: the learning rate's path through training matters as much as its peak.
Momentum lets the ball coast through narrow valleys , Plain SGD oscillates across the walls; momentum smooths the path along the valley floor.

Chapter 11: CNNs

2D convolution, kernel slides over input , A 3×3 kernel sweeps a 9×9 input, filling in a feature map cell by cell.
From LeNet to ResNet: depth grows, accuracy follows , LeNet-5, AlexNet, VGG, GoogLeNet, ResNet. Each year deeper, with new tricks.
Max pooling and average pooling , A two by two patch reduces to one number. Max takes the largest, average takes the mean.
Stacking convolutions grows the receptive field , A pixel in layer three sees a much bigger patch of the input than a pixel in layer one.

Chapter 12: Sequence Models

An RNN is the same cell, applied at every time step , The recurrence unrolled across time becomes a deep feed-forward network with shared weights.
Attention as alignment in seq2seq , An encoder produces hidden states; the decoder weights them dynamically per output token.
LSTM cell and the constant-error carousel , The forget gate keeps gradients alive over long sequences.
Multiply many small derivatives and the gradient vanishes , Repeated multiplication of fractions less than one drives the gradient toward zero in deep networks.

Chapter 13: Attention & Transformers

Causal masking forces a transformer to look only at the past , An upper-triangular mask sets future attention scores to minus infinity.
Multiple attention heads in parallel , Each head learns a different similarity pattern. Their outputs concatenate and project to one tensor.
Self-attention as Q–K–V dot products , Query, key and value vectors produce an attention matrix over four tokens.
Sinusoidal positional encodings , Sines and cosines of many frequencies tag each position with a unique fingerprint.

Chapter 14: Generative Models

Diffusion sampling, from noise to image , Start at pure Gaussian noise, denoise step by step, and a structure emerges.
GAN training is a two-player game , The generator tries to fool the discriminator. The discriminator tries not to be fooled. Equilibrium is realistic samples.
Latent space interpolation in a VAE , Walk a straight line between two latent codes and the decoded image morphs smoothly.
Score matching learns the gradient of log density , The score points uphill on the data density; matching it lets you sample by stochastic ascent.
The forward diffusion process: adding noise step by step , An image gradually corrupted by Gaussian noise becomes pure static.

Chapter 15: Modern AI

Chain-of-thought prompting , Asking a model to think step by step before answering improves accuracy on multi-step problems.
Few-shot examples teach the model in the prompt , A handful of input-output pairs in the prompt steer a frozen model to a new task.
Inside a transformer block , Multi-head attention, a feed-forward network, residual connections and layer norm: the building block of every modern LLM.
Mixture of experts: a router selects k specialists , Each token activates only a few experts; the network grows in capacity without growing in compute per token.
Scaling laws: compute, data, and parameters jointly determine loss , Plot loss against compute on a log-log scale and you get a clean line.
Test-time compute scaling , Accuracy as a function of inference budget for three model strengths.

Chapter 16: Ethics & Safety

An adversarial example: a tiny perturbation flips the prediction , Add an imperceptible pattern to a panda image; the network now sees a gibbon with high confidence.
Demographic parity vs equalised odds , Different fairness criteria pull a classifier in different directions, and they cannot all hold at once.
Mesa-optimisation: an objective hidden inside a learned model , The base optimiser trains a model that is itself an optimiser, with its own learned objective.
The fairness-accuracy frontier , Push fairness up, accuracy often drops. The Pareto frontier shows the best trade.

Chapter 17: Applications

AlphaFold predicts protein structure from sequence , Twenty amino acids in a chain fold into a unique three-dimensional ribbon, predicted by attention.
AlphaGo's Monte Carlo tree search , MCTS expands promising moves, simulates rollouts, and backs up scores.
Protein folding, sequence to structure , An extended chain of residues collapses into a compact 3D structure.
Two-tower recommendation: users and items in shared embedding space , User tower and item tower learn embeddings; relevance is the dot product.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).