Adam: per-parameter adaptive learning rates, Textbook of AI

Adam keeps a moving average of the gradient and the squared gradient. The ratio adapts each step.

From the chapter: Chapter 10: Training & Optimisation

Glossary: adam

Transcript

Stochastic gradient descent uses one learning rate for every parameter. The same number multiplies every gradient.

Adam adapts. It gives each parameter its own effective step size.

Two running averages. The first moment, an exponential moving average of the gradient. The second moment, an exponential moving average of the squared gradient.

The first moment is like momentum. It smooths out noise across consecutive steps. The second moment estimates the variance, the recent magnitude of the gradient.

Each step, Adam divides the first moment by the square root of the second. Parameters with large recent gradients get a small effective step. Parameters with small recent gradients get a large effective step.

Watch a 2D loss surface, badly conditioned. SGD bounces along the steep walls. Adam takes balanced steps in both directions, finding the valley quickly.

Adam adds a small bias correction at the start, when the moving averages have not yet warmed up.

Hyperparameters: a global learning rate, the two decay rates for the moments, and a small epsilon for numerical stability. The defaults work surprisingly often.

Adam is the workhorse optimiser for transformers, language models, and most deep learning today. The intuition is one line: parameter-wise gradient normalisation with momentum.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).