Gradients flow backward through the layers, Textbook of AI

Forward pass produces a loss. Reverse pass propagates the gradient through every layer, multiplying local Jacobians.

From the chapter: Chapter 9: Neural Networks

Glossary: backpropagation, chain rule

Transcript

A neural network is a stack of layers. Forward pass: input flows left to right, each layer transforms it, the last layer produces a loss.

To train, we need the gradient of the loss with respect to every weight.

Backpropagation is the chain rule applied recursively, layer by layer.

Start at the output. Compute the loss gradient with respect to the final layer's input. Call this the upstream gradient.

At each layer, the upstream gradient combines with the layer's local Jacobian to produce two things: the gradient with respect to the layer's parameters, used for the update, and the gradient with respect to the layer's input, which becomes the upstream gradient for the layer below.

Watch the gradient flow. Small numbers in the upstream, multiplied by the local derivatives, push backward.

After the gradient has visited every layer, every parameter has its update.

The forward pass costs one matrix multiply per layer. The backward pass costs two: one for the parameter gradient, one for the input gradient.

Backprop is twice as expensive as forward, but it computes derivatives for millions of weights in a single sweep. This efficiency, more than anything else, is what made deep learning practical.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).