Neural Networks: Further reading

Dr Chris Paton

Goodfellow, Bengio, and Courville, Deep Learning (MIT Press, 2016). Chapters 6--8 cover MLPs, optimisation, and regularisation in detail.
Bishop, Pattern Recognition and Machine Learning (Springer, 2006). Chapter 5 derives backpropagation cleanly.
Russell and Norvig, Artificial Intelligence: A Modern Approach (4th ed., 2020). Chapters 21--22 cover learning from a unified perspective.
Karpathy's micrograd and makemore video lectures: the clearest introduction to autograd from scratch.
He et al. (2016), Ioffe and Szegedy (2015), Vaswani et al. (2017): three landmark papers whose ideas (residual connections, batch normalisation, attention) underlie almost every modern deep network.
Loshchilov and Hutter (2019), Decoupled Weight Decay Regularization (the AdamW paper). The single best reference for understanding why Adam and L2 are not interchangeable.
Smith (2017), Cyclical Learning Rates for Training Neural Networks. The learning-rate finder, in print.

The neural-network toolkit assembled here is the basis for the rest of the book. Whatever flavour of model you encounter, the same primitives apply: a parameterised function, a loss, a gradient, an optimiser, and a way to keep training stable and well-regularised. The remaining chapters add domain-specific structure (convolutions for images, attention for sequences, sampling for generation), but the spine is the one we have built.

Textbook of AI

Further reading

Further Learning