Chapter Nine

Neural Networks

Learning Objectives
  1. Explain the perceptron model, its update rule, and the historical XOR limitation
  2. Assemble multilayer feed-forward networks and compute forward propagation
  3. Compare activation functions (sigmoid, tanh, ReLU, GELU) and their gradient properties
  4. Derive the backpropagation algorithm as a recursive application of the chain rule
  5. Recognise common network architectures (MLP, CNN, RNN, Transformer) and describe the universal approximation theorem

In 2012, a neural network identified a cat in a YouTube video. No one had told it what a cat looked like. The system had learned on its own, from raw pixels, by adjusting millions of numerical dials until the right patterns emerged. That moment captured something deep about how these models work: they discover structure in data that humans never explicitly describe.

Neural networks are loosely inspired by biological brains. They consist of layers of simple processing units that transform raw inputs into predictions. The field's history swings between hype and disillusionment. Rosenblatt's perceptron promised thinking machines in the 1950s. Minsky and Papert's critique froze research for a decade. Backpropagation revived it in the 1980s, kernel methods overshadowed it in the 1990s, and the deep learning explosion of the 2010s made neural networks dominant. Today they power vision, language, speech, robotics, and game playing.

This chapter is long because the material is foundational. We move from the single neuron through the multilayer perceptron to a full implementation in NumPy that classifies handwritten digits. Along the way we derive backpropagation rigorously, work through several numerical examples by hand, build a tiny automatic differentiation engine, examine why initialisation and normalisation are essential rather than ornamental, and compare activation functions and regularisers. By the end you will have written a network that learns from raw pixels and you will understand exactly why the gradients that drove its learning have the form they do.

A note on what we will not do. We will not justify every heuristic by appeal to a theorem; many tricks of the trade (warmup, gradient clipping thresholds, the precise value of dropout probability) emerged from empirical practice and have only partial theoretical backing. We will state the theory where it is decisive, the empirical evidence where it is overwhelming, and we will say so plainly when something works for reasons we still do not fully understand. Deep learning is, at this stage of its history, partly a science and partly a craft. The chapter teaches both.

In this chapter

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.