Chapter Three

Calculus

Learning Objectives
  1. Compute derivatives of common functions and interpret them as instantaneous rates of change or slopes
  2. Extend the derivative to multivariable functions through the gradient and describe its geometric meaning
  3. State and apply the multivariable chain rule using Jacobians, and explain how it underpins backpropagation
  4. Trace forward and backward passes through a computational graph, and implement a tiny reverse-mode autograd engine
  5. Implement gradient descent and explain how the learning rate, momentum, and curvature govern convergence
  6. Derive the gradients of the standard loss functions used in deep learning (MSE, softmax + cross-entropy, regularised quadratics)
  7. Diagnose autograd pitfalls (in-place ops, detach, computational-graph leaks) and verify gradients numerically

Linear algebra gives AI its vocabulary, vectors, matrices, transformations. Calculus gives it the ability to learn. At the heart of nearly every modern machine-learning system is a single computational pattern: a scalar loss is computed by composing many simple functions, then a gradient is taken with respect to billions of parameters so that the parameters can be nudged downhill. The composing happens on the forward pass. The differentiating happens on the backward pass. The nudging is gradient descent. That entire loop is calculus, executed at industrial scale.

The aim of this chapter is to take a reader who is comfortable with first-year calculus, limits, derivatives, the single-variable chain rule, basic integrals, and bring them to a place where the equations on the next two hundred pages of this book read fluently. By the end you will:

  • Know exactly what backpropagation is, and why it is just the chain rule traversed in reverse on a graph.
  • Be able to derive the gradients you will see again and again: $\nabla_x (a^\top x)$, $\nabla_x (x^\top A x)$, the gradient of softmax composed with cross-entropy, the gradient of mean-squared error.
  • Have built a tiny autograd engine in Python from scratch, in fewer than two hundred lines of code.
  • Understand when forward-mode automatic differentiation is appropriate and when reverse-mode is, and why deep learning lives almost entirely in the latter.
  • Recognise the standard pitfalls, vanishing and exploding gradients, in-place operations, broken graphs, and know how to diagnose them.

Pen and paper are still the right tools for several sections of this chapter. There is no substitute for working a small backpropagation example by hand at least once.

In this chapter

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.