Further reading
- Goodfellow, Bengio, and Courville, Deep Learning (MIT Press, 2016). Chapters 6--8 cover MLPs, optimisation, and regularisation in detail.
- Bishop, Pattern Recognition and Machine Learning (Springer, 2006). Chapter 5 derives backpropagation cleanly.
- Russell and Norvig, Artificial Intelligence: A Modern Approach (4th ed., 2020). Chapters 21--22 cover learning from a unified perspective.
- Karpathy's micrograd and makemore video lectures: the clearest introduction to autograd from scratch.
- He et al. (2016), Ioffe and Szegedy (2015), Vaswani et al. (2017): three landmark papers whose ideas (residual connections, batch normalisation, attention) underlie almost every modern deep network.
- Loshchilov and Hutter (2019), Decoupled Weight Decay Regularization (the AdamW paper). The single best reference for understanding why Adam and L2 are not interchangeable.
- Smith (2017), Cyclical Learning Rates for Training Neural Networks. The learning-rate finder, in print.
The neural-network toolkit assembled here is the basis for the rest of the book. Whatever flavour of model you encounter, the same primitives apply: a parameterised function, a loss, a gradient, an optimiser, and a way to keep training stable and well-regularised. The remaining chapters add domain-specific structure (convolutions for images, attention for sequences, sampling for generation), but the spine is the one we have built.