Further reading

  • Goodfellow, Bengio, and Courville, Deep Learning (MIT Press, 2016). Chapters 6--8 cover MLPs, optimisation, and regularisation in detail.
  • Bishop, Pattern Recognition and Machine Learning (Springer, 2006). Chapter 5 derives backpropagation cleanly.
  • Russell and Norvig, Artificial Intelligence: A Modern Approach (4th ed., 2020). Chapters 21--22 cover learning from a unified perspective.
  • Karpathy's micrograd and makemore video lectures: the clearest introduction to autograd from scratch.
  • He et al. (2016), Ioffe and Szegedy (2015), Vaswani et al. (2017): three landmark papers whose ideas (residual connections, batch normalisation, attention) underlie almost every modern deep network.
  • Loshchilov and Hutter (2019), Decoupled Weight Decay Regularization (the AdamW paper). The single best reference for understanding why Adam and L2 are not interchangeable.
  • Smith (2017), Cyclical Learning Rates for Training Neural Networks. The learning-rate finder, in print.

The neural-network toolkit assembled here is the basis for the rest of the book. Whatever flavour of model you encounter, the same primitives apply: a parameterised function, a loss, a gradient, an optimiser, and a way to keep training stable and well-regularised. The remaining chapters add domain-specific structure (convolutions for images, attention for sequences, sampling for generation), but the spine is the one we have built.

Further Learning

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).