- Compute derivatives of common functions and interpret them as instantaneous rates of change or slopes
- Extend the derivative to multivariable functions through the gradient and describe its geometric meaning
- Apply the chain rule to compose functions and derive the algorithm used in backpropagation
- Implement gradient descent and explain how the learning rate governs convergence and stability
- Interpret integrals as accumulated quantities and recognise their role in probability and expected values
Linear algebra gives AI its vocabulary — vectors, matrices, transformations. Calculus gives it the ability to learn. At the heart of nearly every ML algorithm is an optimisation problem: find the parameters that minimise a loss function. Solving it requires computing derivatives, following gradients, and propagating error signals through chains of functions. That is calculus.
This chapter covers derivatives, gradients, the chain rule (the engine of backpropagation), gradient descent, and integrals. By the end, you will understand the calculus behind every neural network in this book.
3.1 Derivatives
The derivative of f at x measures how fast f(x) changes as x moves:
f′(x) = limh→0 [f(x + h) − f(x)] / h
Geometrically, it is the slope of the tangent line. Positive means increasing; negative means decreasing; zero means a local extreme or inflection point.
Key Rules
- Power rule: d/dx [x^n^] = nx^n−1^
- Sum rule: derivative of a sum = sum of derivatives
- Product rule: (fg)′ = f′g + fg′
- Quotient rule: (f/g)′ = (f′g − fg′) / g²
- Exponential: d/dx [e^x^] = e^x^ (appears in softmax and probability distributions)
- Logarithm: d/dx [ln x] = 1/x (appears in cross-entropy and log-likelihoods)
Derivatives in ML
The function you differentiate is usually a loss function. For example, the mean squared error: L(w) = (1/n) Σ (yi − w^T^xi)², where w is the weight vector, xi is the i-th input, and yi is the i-th target. The derivative of L with respect to each weight tells you which direction to adjust. Moving in the direction that reduces the loss is the essence of gradient-based learning.
Second Derivatives and Curvature
The second derivative f″(x) measures how quickly the slope itself changes. If f″ > 0, the function curves upward (a local minimum sits at f′ = 0). If f″ < 0, it curves downward (a local maximum). In higher dimensions, the Hessian matrix of second partial derivatives encodes curvature in every direction. Second-order methods (Newton's method) use the Hessian for faster convergence.
Activation Function Derivatives
Some derivatives are cheap to compute, which matters when you need billions of them:
- ReLU: derivative is 1 for x > 0, 0 for x < 0. The simplest possible derivative.
- Sigmoid: σ′(x) = σ(x)(1 − σ(x)). Computed from the function value itself.
Locality
The derivative is local — it describes what happens in a tiny neighbourhood. You can walk downhill step by step, but you have no guarantee of reaching the global lowest point. That is both the power and the limitation of gradient-based optimisation.
3.2 Gradients
When a function depends on many variables (as every ML loss does), you use partial derivatives. The partial derivative ∂f/∂xi measures the rate of change along one axis, holding the others fixed.
The gradient collects all partial derivatives into a vector:
∇f = (∂f/∂x₁, ∂f/∂x₂, …, ∂f/∂xₙ)
Geometric Meaning
The gradient points in the direction of steepest increase. Its magnitude is the rate of increase in that direction. The negative gradient points downhill — exactly where you want to go when minimising a loss.
To build intuition, picture a two-variable loss L(w₁, w₂) as a bowl-shaped surface over the weight plane. The gradient ∇L is a 2D vector lying in the weight plane, pointing uphill. Contour lines (curves of constant L) are always perpendicular to the gradient. Walking along the gradient climbs fastest; walking along a contour changes nothing.
Contour lines (curves of constant f) are always perpendicular to the gradient. Walking along a contour changes nothing. Walking along the gradient changes f the most.
Gradients in Neural Networks
For a network with millions of parameters, the gradient of the loss ∇θL is a million-dimensional vector. Each component says how the loss changes if you nudge that parameter. Computing this vector efficiently is the job of the chain rule and backpropagation.
Directional Derivatives and the Jacobian
The directional derivative in direction u (a unit vector) is Duf = ∇f · u. The maximum value is ‖∇f‖, reached when u = ∇f/‖∇f‖. This confirms: the gradient is the steepest direction.
For vector-valued functions f: ℝ^n^ → ℝ^m^, the analogue is the Jacobian — an m × n matrix whose (i, j) entry is ∂fi/∂xj. Each row is the gradient of one output component.
3.3 The Chain Rule
The chain rule is the single most important calculus result for ML. It tells you how to differentiate a composition of functions.
Single Variable
If y = f(g(x)), then dy/dx = f′(g(x)) · g′(x). Differentiate the outer function, evaluate at the inner, multiply by the derivative of the inner.
For deeper chains: if y = f(g(h(x))), then dy/dx = f′ · g′ · h′, each evaluated at the right point.
Why It Matters for Neural Networks
A neural network is a deeply nested composition of functions. Input → multiply by W₁ → activation → multiply by W₂ → activation → … → loss. Computing ∂L/∂W₁ requires differentiating through every subsequent layer. The chain rule provides the machinery.
Multivariate Chain Rule
For f ∘ g where g: ℝ^n^ → ℝ^m^ and f: ℝ^m^ → ℝ:
∇x(f ∘ g) = Jg^T^ ∇gf
The Jacobian transpose times the gradient of the outer function. This is computed layer by layer, starting from the output and working backward.
Backpropagation
Backpropagation Rumelhart, 1986 is just the chain rule applied systematically. The forward pass computes outputs layer by layer, saving intermediate results. The backward pass propagates the gradient from output to input, multiplying by local Jacobians at each layer. Weight gradients are computed as a byproduct. The whole thing costs roughly twice a forward pass and scales to billions of parameters.
Vanishing and Exploding Gradients
The chain rule multiplies many factors together. If these factors are consistently < 1, gradients shrink exponentially (**vanish**). If > 1, they grow exponentially (explode). Vanishing gradients mean early layers cannot learn. Exploding gradients cause instability.
Fixes:
- Residual connections He, 2016: add the input to the output, creating a direct gradient path.
- Normalisation layers [Ioffe, 2015; Ba, 2016]: stabilise activation and gradient distributions.
- Careful initialisation (Xavier Glorot, 2010, Kaiming He, 2015): set initial weight scales to preserve gradient magnitudes.
- Gradient clipping: cap the gradient norm to prevent explosions.
These techniques made it possible to train networks with hundreds of layers.
3.4 Gradient Descent
Gradient descent turns calculus into a practical training algorithm. Start with initial parameters θ and repeat:
θnew = θold − η ∇θL
Each step moves downhill on the loss surface. η is the learning rate. You repeat until the parameters converge (the gradient becomes negligibly small) or until a computational budget is exhausted.
The Learning Rate
Too high: the model overshoots and oscillates or diverges. Too low: convergence crawls and may get stuck. Start with a range (10^−1^ to 10^−4^) and observe. Learning rate schedules — reducing η over time — are widely used and often essential.
Batch, Stochastic, and Mini-Batch
- Batch gradient descent: gradient from the full training set. Accurate but expensive.
- Stochastic gradient descent (SGD): gradient from one random example. Noisy but cheap — and in expectation it points in the right direction (SGD is an unbiased estimator of the true gradient).
- Mini-batch SGD Robbins, 1951: gradient from a small random subset (32–4,096 examples). The standard for deep learning. The noise can actually help, pushing the optimiser away from sharp minima toward flatter ones that generalise better.
Training is measured in epochs (one pass through the full dataset). Typical runs: tens to hundreds of epochs.
Adaptive Optimisers
- Momentum: accumulate past gradients to dampen oscillations and speed up consistent directions. Like a ball rolling downhill with inertia.
- RMSProp: give each parameter its own adaptive learning rate, based on the size of recent gradients.
- Adam Kingma, 2014: combines momentum with per-parameter scaling. Includes bias correction for the first steps. The default optimiser for most deep learning.
Limitations
Gradient descent finds local minima, not necessarily the global one. But for large neural networks, most local minima are nearly as good as the global one. A bigger problem is saddle points — where the gradient is zero but the point is neither a minimum nor a maximum. These become more common in high dimensions and can slow convergence. Understanding loss surface geometry remains an active research area.
3.5 Integrals
Derivatives dominate day-to-day training, but integrals are equally fundamental to the theory. They appear whenever you need to accumulate, average, or marginalise.
The Definite Integral
∫a^b^ f(x) dx is the signed area under f between a and b. The fundamental theorem of calculus connects differentiation and integration: if F′ = f, then ∫a^b^ f(x) dx = F(b) − F(a).
Integrals in Probability
A continuous distribution has a PDF p(x) with ∫ p(x) dx = 1. Probabilities are areas:
P(a ≤ X ≤ b) = ∫a^b^ p(x) dx
The expected value is E[X] = ∫ x p(x) dx. The variance is Var(X) = ∫ (x − E[X])² p(x) dx. These define the key quantities of statistical reasoning.
Marginalisation
A joint distribution over two continuous variables satisfies ∫∫ p(x, y) dx dy = 1. To get the distribution of X alone, integrate out Y: p(x) = ∫ p(x, y) dy. This is central to Bayesian inference, where you often integrate over latent variables or model parameters. In n dimensions, the integral becomes an n-fold integral, and the cost of evaluating it exactly can be prohibitive — motivating the approximate methods below.
Intractable Integrals
Many important integrals in ML have no closed form — posterior distributions in Bayesian networks, partition functions in energy models, the ELBO in VAEs. Two families of approximation:
- Monte Carlo: approximate E[f(X)] = ∫ f(x) p(x) dx by drawing N samples from p and averaging: (1/N) Σᵢ f(xᵢ). By the law of large numbers, this converges to the true integral. The central limit theorem tells us the error shrinks as 1/√N, regardless of dimension — this dimension-independence is Monte Carlo's key advantage over deterministic methods (which suffer from the curse of dimensionality).
- Variational methods: replace the intractable distribution with a simpler one and optimise to minimise the gap.
Both are foundational to modern probabilistic ML.
Information-Theoretic Integrals
Entropy, KL divergence, and cross-entropy are all defined by integrals:
- Entropy: H(X) = −∫ p(x) ln p(x) dx
- KL divergence: D
KL(p ‖ q) = ∫ p(x) ln [p(x)/q(x)] dx - Cross-entropy: H(p, q) = −∫ p(x) ln q(x) dx
These quantities drive the loss functions of many generative models. Integration is not a side topic — it is part of the theoretical fabric of AI.