3.1 Why calculus is the engine of AI
Every modern artificial intelligence system, from the simplest spam filter to the largest language model, is at heart a single mathematical object: a function. The function takes some input, an image, a sentence, a sound clip, a row of patient data, and returns an output, perhaps a label, a probability, a sequence of words, or a numerical prediction. What makes one function clever and another useless is not its overall shape but the millions or billions of small numbers, called parameters, that sit inside it. Choose the parameters well and the function recognises tumours on a scan; choose them badly and the function is hardly better than guessing. Training an AI system is, in the end, the patient business of finding good parameters.
The trouble is that we cannot find them by hand. A modern image classifier may have a hundred million parameters; a frontier language model has hundreds of billions. There is no possibility of reasoning about each one individually, nor of trying combinations at random, the number of possible settings is astronomically larger than the number of atoms in the observable universe. We need a principled procedure that nudges the parameters, all together, in a direction that makes the function behave better. That procedure is the heart of the chapter you are about to read, and it rests entirely on calculus.
Here is the basic picture. Suppose we have a way of scoring how badly the model is doing on its training examples, a single number, called the loss, that grows when the predictions are wrong and shrinks when they are right. The loss depends on the parameters: change the parameters and the loss changes. We can therefore think of the loss as a landscape, with valleys where the model performs well and hills where it performs badly. The training problem is to walk downhill, from wherever we happen to start, into the lowest valley we can find.
Calculus is what tells us, at any given point on the landscape, which way is downhill. The mathematical object that captures this idea is called the gradient. A gradient is an arrow: it points in the direction in which the loss increases fastest. To improve the model, we step in the opposite direction. Repeat this thousands or millions of times and the model gradually settles into a low place on the landscape, a setting of the parameters that fits the data well. The procedure has a name: gradient descent. Without calculus there are no gradients; without gradients there is no gradient descent; without gradient descent there is no modern AI.
Chapter 2 gave us linear algebra: vectors, matrices, dot products, eigenvalues, the language in which inputs, parameters, and intermediate quantities are most naturally described. Linear algebra is the static skeleton of a model, what the parts are and how they fit together. Calculus is the moving spirit, how those parts respond when we wiggle them, and which way to wiggle them next. The two together are the mathematical engine of training. Section 3.2 onwards develops derivatives systematically, building patiently from limits to gradients to the chain rule and finally to the algorithm called backpropagation. By the end of the chapter you will be able to read the body of a PyTorch training loop and know exactly what each line is doing.
The training problem
Almost every supervised learning task, and "supervised" means we are given examples paired with correct answers, can be cast in the same form. We are handed a collection of training pairs $\{(\mathbf{x}_i, y_i)\}$, where $\mathbf{x}_i$ is the input (perhaps an image as a long vector of pixel intensities, perhaps the laboratory results for a single patient) and $y_i$ is the correct answer for that input (perhaps the label "malignant", perhaps the value of the patient's blood pressure six months later). Our job is to find a predictor, a function $f(\mathbf{x}; \boldsymbol{\theta})$ that maps inputs to outputs, whose predictions on the training inputs are close to the known correct answers.
The predictor is parameterised: $\boldsymbol{\theta}$ is a vector containing all of its adjustable knobs. For a single straight line the vector $\boldsymbol{\theta}$ has just two entries (slope and intercept). For a small neural network it might have a few thousand entries. For a large language model the vector has hundreds of billions of entries. The set of allowable $\boldsymbol{\theta}$ is enormous, and the great majority of choices give a useless predictor. Training is the search for a good one.
To turn "close to $y_i$" into something we can optimise, we choose a loss function $\mathcal{L}(\boldsymbol{\theta})$. The loss is a single non-negative number that summarises, across the entire training set, how badly the predictor is doing. A common choice for regression problems is the average squared error,
$$\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N} \sum_{i=1}^N \big(f(\mathbf{x}_i; \boldsymbol{\theta}) - y_i\big)^2,$$
which grows quadratically in the gap between prediction and truth. For classification problems we typically use cross-entropy. The exact form does not matter for our present purpose. What matters is that $\mathcal{L}$ is a single number, it depends on the parameters, and small values mean a good model.
The training problem is now sharply stated: choose $\boldsymbol{\theta}$ to minimise $\mathcal{L}(\boldsymbol{\theta})$. For straight-line fits we can solve this with a pencil and a piece of paper. For anything more interesting, even a modest neural network with a single hidden layer, there is no closed-form solution. We can only iteratively improve $\boldsymbol{\theta}$, starting from a random initial guess and refining it step by step.
The natural way to refine it is to step in the direction of steepest descent, which calculus tells us is the negative gradient $-\nabla \mathcal{L}$. The update rule is
$$\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \nabla \mathcal{L}(\boldsymbol{\theta}_t),$$
where $\eta$, the learning rate, controls how large a step we take. Calculus gives us $\nabla \mathcal{L}$; the rest of the chapter is concerned with computing it efficiently and using it wisely.
A concrete example: fitting a line to two points
Abstractions become much friendlier once we have run a tiny example by hand. Take a training set with just two points, $(x_1, y_1) = (1, 2)$ and $(x_2, y_2) = (2, 5)$. Our hypothesis is the simplest possible parameterised function, a straight line
$$f(x; w, b) = wx + b,$$
with two parameters: the slope $w$ and the intercept $b$. Our loss is half the sum of squared errors,
$$\mathcal{L}(w, b) = \tfrac{1}{2}\big((w + b - 2)^2 + (2w + b - 5)^2\big).$$
The factor of one-half is a customary convenience: it cancels the 2 produced by differentiating the square, leaving cleaner formulae.
To find the best $(w, b)$ we compute the partial derivatives, calculus, applied one parameter at a time. Differentiating with respect to $w$, treating $b$ as a constant, gives
$$\frac{\partial \mathcal{L}}{\partial w} = (w + b - 2) + 2(2w + b - 5) = 5w + 3b - 12.$$
Differentiating with respect to $b$, treating $w$ as a constant, gives
$$\frac{\partial \mathcal{L}}{\partial b} = (w + b - 2) + (2w + b - 5) = 3w + 2b - 7.$$
A minimum of a smooth function lies where the gradient is zero, both partial derivatives must vanish. So we set
$$5w + 3b = 12, \qquad 3w + 2b = 7.$$
Multiplying the first equation by 2 gives $10w + 6b = 24$; multiplying the second by 3 gives $9w + 6b = 21$. Subtracting the second from the first leaves $w = 3$. Back-substituting into $3w + 2b = 7$ gives $9 + 2b = 7$, so $b = -1$. The least-squares line is therefore $y = 3x - 1$, which passes through both training points exactly: $3 \cdot 1 - 1 = 2$ and $3 \cdot 2 - 1 = 5$.
Two lessons follow. First, calculus did the real work: by demanding $\nabla \mathcal{L} = \mathbf{0}$ we converted a vague request ("find a good line") into a pair of linear equations that any schoolchild can solve. Second, the closed-form trick worked only because the loss was quadratic and the model linear. Replace the line with a neural network, composing many linear transformations with non-linear activations, and the equation $\nabla \mathcal{L} = \mathbf{0}$ becomes a system of millions of coupled non-linear equations with no algebraic solution.
In that case we fall back on iteration. Start from any initial guess, say $(w_0, b_0) = (0, 0)$, where the gradient evaluates to $(\partial \mathcal{L}/\partial w, \partial \mathcal{L}/\partial b) = (-12, -7)$. Take a small step in the negative gradient direction: with learning rate $\eta = 0.1$ we get $(w_1, b_1) = (1.2, 0.7)$. Recompute the gradient at the new point. Step again. Repeat. After a few dozen iterations the parameters converge to $(3, -1)$, the same answer the closed-form solve gave us. Gradient descent is patient where the algebra is impossible.
What calculus we'll need
A road map of the chapter:
- Section 3.2: limits and continuity. The foundations on which derivatives are built. We say what it means for a function to be smooth enough to differentiate.
- Section 3.3: derivatives in one variable. The familiar slope-of-the-tangent picture, plus the rules (sum, product, quotient, chain) that let us compute derivatives mechanically.
- Section 3.4: partial derivatives and gradients. What changes when the input is a vector instead of a number. The gradient is introduced as the natural generalisation of the slope.
- Section 3.5: chain rule for vector-valued functions. The single most important identity in machine learning. The chain rule explains how derivatives propagate through compositions of functions, which is what every neural network is.
- Sections 3.6 to 3.8: computational graphs and automatic differentiation. A computational graph is a picture of the calculation a network performs; automatic differentiation, also called backpropagation, is a mechanical way of applying the chain rule to that picture to get every gradient at once.
- Section 3.9: gradient descent. The optimiser proper, including the variants, momentum, Adam, RMSProp, that practitioners actually use.
- Section 3.10: Hessians and Newton's method. Second derivatives and what they tell us about curvature, including a faster but more expensive optimisation method.
- Sections 3.11 and 3.12: variational calculus and useful gradient identities. A small zoo of tricks and a glimpse of optimisation over function spaces.
- Section 3.13: numerical issues. Why floating-point arithmetic occasionally betrays us, and how to keep training numerically stable.
- Section 3.14: a hand-traced training run. A complete, by-hand training of a tiny network, so that every previous piece of theory becomes concrete.
Why calculus is so much more than "derivatives"
Many students, traumatised by school, think of calculus as a collection of rote procedures for computing slopes. That impression badly underestimates the subject. Calculus also gives us:
- Convexity. A function is convex if its graph curves upward like a bowl. When the loss is convex, gradient descent is guaranteed to find the global minimum, we can never get stuck in a poor local valley because no such valleys exist. Linear and logistic regression have convex losses; deep neural networks do not. Knowing whether your problem is convex tells you what kind of guarantees you can expect from your optimiser, and most of the deep-learning literature can be read as a pragmatic study of how to train non-convex models well.
- Taylor expansions. Locally, almost any smooth function looks like a polynomial. The Taylor expansion makes this precise: it approximates a function near a point by its value, its gradient, its Hessian, and so on. Newton's method uses the quadratic part of the Taylor expansion to take a smarter step than vanilla gradient descent. The natural-gradient methods used in reinforcement learning use a slightly different but related idea. Even ordinary stochastic gradient descent can be analysed as following the gradient of a Taylor approximation.
- Integration. Differentiation has a partner, integration, which we shall need the moment we move from calculus into probability theory in Chapter 4. Probability densities must integrate to one; expected values are integrals; the partition functions of statistical models are integrals. Bayesian inference is, in the end, an exercise in computing or approximating high-dimensional integrals, and modern variational methods replace exact integration with optimisation problems that are themselves solved by gradient descent.
- Calculus of variations. This is the calculus not of single numbers or single vectors but of entire functions. We ask: what function makes a given quantity smallest? The Euler–Lagrange equation, central to physics, also appears in machine learning when we derive evidence lower bounds for variational autoencoders or optimal-policy equations in reinforcement learning. It is calculus, scaled up to infinite dimensions.
The full power of calculus is invisible if you only think of it as the inverse of differentiation. Every time you read a paper that proves a convergence rate, derives a regularisation penalty, justifies a loss function, bounds a gap between model and data, or analyses a learning algorithm, calculus is the language in which the argument is conducted. The chapter ahead will try to give you a working command of that language.
What you should take away
- Every modern AI system is a parameterised function, and training that system means choosing parameters that make a loss function small.
- For non-trivial models the loss has no closed-form minimiser; we must iterate, and the natural direction to iterate in is the negative of the gradient.
- The gradient, an arrow pointing in the direction of steepest increase, is a calculus object, and gradient descent is the algorithm that exploits it.
- Linear algebra (Chapter 2) supplies the static structure of the model; calculus (this chapter) supplies the dynamics of training. They are partners, and neither is sufficient on its own.
- Calculus is much more than slope-finding: convexity, Taylor expansions, integration, and the calculus of variations all reappear throughout AI, and a fluent practitioner needs to recognise them in the wild.