Chapter Nine

Neural Networks

Learning Objectives
  1. Explain the perceptron model, its update rule, and the historical XOR limitation
  2. Assemble multilayer feed-forward networks and compute forward propagation
  3. Compare activation functions (sigmoid, tanh, ReLU, GELU) and their gradient properties
  4. Derive the backpropagation algorithm as a recursive application of the chain rule
  5. Recognise common network architectures (MLP, CNN, RNN, Transformer) and describe the universal approximation theorem

In 2012, a neural network identified a cat in a YouTube video. No one had told it what a cat looked like. The system had learned on its own, from raw pixels, by adjusting millions of numerical dials until the right patterns emerged. That moment captured something deep about how these models work: they discover structure in data that humans never explicitly describe.

Neural networks are loosely inspired by biological brains. They consist of layers of simple processing units that transform raw inputs into predictions. The field's history swings between hype and disillusionment. Rosenblatt's perceptron Rosenblatt, 1958 promised thinking machines in the 1950s. Minsky and Papert's critique Minsky, 1969 froze research for a decade. Backpropagation Rumelhart, 1986 revived it in the 1980s, kernel methods overshadowed it in the 1990s, and the deep learning explosion of the 2010s made neural networks dominant [Krizhevsky, 2012; Goodfellow, 2016]. Today they power vision, language, speech, and game playing.

This chapter moves from the single neuron to the multilayer network. You will learn the mathematical foundations that make learning possible: activation functions, backpropagation, gradient descent, and key architectures. You will also meet the universal approximation theorem, which explains why depth and nonlinearity together can model virtually any function.

9.1   The Perceptron

A single neuron can draw a line between two classes.

The perceptron, introduced by Rosenblatt in 1958 Rosenblatt, 1958, is the simplest neural network. It takes a vector of inputs x = (x1, ..., xd), computes a weighted sum z = w^T^x + b, and applies a step function. If z >= 0, the output is 1; otherwise, 0. Geometrically, this carves a hyperplane through d-dimensional space. Points on one side are positive, points on the other negative. The perceptron is a linear classifier, limited to problems where a flat surface separates the two classes.

The perceptron learning algorithm fixes mistakes one at a time. For each misclassified example, you update the weight vector: w <- w + eta(yi - y-hati)xi, where eta is the learning rate. The perceptron convergence theorem guarantees convergence in finite steps, provided the data is linearly separable. The bound on updates depends on two quantities:

  • The margin -- the minimum distance from any training point to the decision boundary
  • The radius -- the extent of the data

This margin-based reasoning foreshadows support vector machines.

Rosenblatt's perceptron generated enormous excitement. The New York Times declared in 1958 that the Navy had built a machine that could "walk, talk, see, write, reproduce itself and be conscious of its existence." The backlash was equally dramatic. In 1969, Minsky and Papert published Perceptrons Minsky, 1969, proving that a single-layer perceptron cannot compute XOR or any function that is not linearly separable. This result triggered the first "AI winter." What they also acknowledged -- though it was often overlooked -- was that multilayer networks could compute XOR. The problem was that no one knew how to train them.

You can view the perceptron through the lens of optimisation. It performs stochastic gradient descent on the perceptron loss: L(w) = -Sigmai in M yi(w^T^xi + b), where M is the set of misclassified examples. This loss is piecewise linear and non-convex, since M changes as w changes. Yet the convergence theorem guarantees termination when a solution exists. Replace the hard step with a smooth sigmoid and the perceptron loss with cross-entropy, and you get logistic regression (Section 7.2). The perceptron is the historical seed from which both classical statistical models and modern deep networks grew.

Despite its limitations, the perceptron remains essential for building intuition. It introduces the core vocabulary of neural computation:

  • Weights and biases
  • Activation functions
  • Learning rules

It crystallises the key insight that learning is optimisation: adjusting parameters to reduce error. The perceptron also highlights the trade-off between capacity and learnability. A model that is too simple cannot represent complex functions. A model that is too complex may be hard to train or prone to overfitting. The history of neural networks is largely the story of navigating this trade-off.

9.2   Multilayer Networks

Stack layers, and the limitations vanish.

A multilayer network -- also called a multilayer perceptron (MLP) -- has an input layer, one or more hidden layers, and an output layer. Every unit in one layer connects to every unit in the next (fully connected). The math proceeds by forward propagation:

  • Input x enters the first hidden layer.
  • That layer computes h1 = sigma(W1x + b1), where W1 is the weight matrix, b1 the bias vector, and sigma a nonlinear activation.
  • Each subsequent layer transforms the previous output: hl = sigma(Wlhl-1 + bl).
  • The final layer produces the prediction, using sigmoid for binary classification, softmax for multi-class, or identity for regression.

The hidden layers give the network its power. Each one turns the input into a more abstract form. In an image classifier, the first layer might detect edges. The second might combine them into textures and shapes. Deeper layers might spot object parts and whole objects. This layered feature learning is the key gain of deep networks. The network finds its own features from raw data, with no hand-crafting.

The network's capacity -- the richness of functions it can represent -- grows with both width (units per layer) and depth (number of layers). These contribute differently:

  • Width adds more basis functions, enabling finer-grained approximation within a single layer.
  • Depth enables composition, representing complex mappings as a sequence of simpler transformations.

Theoretical results show that certain functions can be represented exponentially more efficiently with deep networks than with shallow ones. This provides formal justification for the empirical superiority of depth in vision and language.

Training involves several choices. The loss function measures error: MSE for regression, cross-entropy for classification. The optimiser sets how weights change given gradients. SGD Robbins, 1951 is the baseline; Adam Kingma, 2014 is the modern default, adjusting the learning rate per weight. Weight initialisation matters. All-zero weights give a symmetric network that cannot learn. Weights that are too large cause saturated outputs. Xavier Glorot, 2010 and He He, 2015 set the initial scale to keep activations stable across layers.

Regularisation prevents the network from memorising training data. The main techniques are:

  • Dropout Srivastava, 2014 -- randomly zeroes a fraction of hidden units during training, forcing redundant representations
  • Batch normalisation Ioffe, 2015 -- normalises activations within each mini-batch to zero mean and unit variance, stabilising training
  • Early stopping -- halts training when validation loss begins to rise
  • Weight decay (L2 regularisation) -- penalises large weights
  • Data augmentation -- expands the training set with transformed copies

Together, these techniques control generalisation in deep networks.

9.3   Activation Functions

Remove the nonlinearity, and depth is useless.

Without a nonlinear step, stacking layers does nothing. Any chain of linear functions is itself linear. A hundred-layer linear network is no more powerful than a single layer. The activation function is what unlocks the network's power.

The sigmoid σ(z) = 1/(1 + e^−z^) was the early default. It maps the real line to (0, 1). Its derivative is σ′ = σ(1 − σ). The tanh maps to (−1, 1), centring outputs and helping gradient flow. Both suffer from vanishing gradients. When z is large, the slope drops to near zero. Learning stops for saturated units. In deep networks, where gradients multiply across many layers, this gets worse fast.

The rectified linear unit (ReLU) changed everything. Defined as ReLU(z) = max(0, z), it was popularised by Nair and Hinton in 2010 Nair, 2010, though studied decades earlier. ReLU offers three advantages:

  • It is computationally trivial -- just a threshold comparison.
  • It does not saturate for positive inputs (gradient is exactly 1).
  • It induces sparsity -- negative pre-activations output exactly zero.

These properties make ReLU networks faster to train and less prone to vanishing gradients. The main drawback is the dying ReLU problem. If a unit's bias shifts sufficiently negative, it outputs zero for all training inputs. Its gradient is also zero, so it never recovers.

Several variants fix the dying ReLU problem while keeping its benefits:

  • Leaky ReLU: LeakyReLU(z) = max(alphaz, z), with alpha typically 0.01, allowing a small gradient for negative inputs
  • Parametric ReLU (PReLU): treats alpha as a learnable parameter
  • Exponential linear unit (ELU): uses an exponential curve for negative inputs -- ELU(z) = z if z > 0, and alpha(e^z^ - 1) otherwise -- producing smooth outputs that push mean activation towards zero
  • Scaled ELU (SELU) Klambauer, 2017: calibrated so activations self-normalise under certain constraints, eliminating the need for batch normalisation

GELU Hendrycks, 2016 is defined as GELU(z) = z · Φ(z), where Φ is the standard normal CDF. It applies smooth gating, weighing each input by its Gaussian likelihood. It blends ReLU's sparsity with sigmoid's smoothness. GELU is the default in Transformers (BERT, GPT). Swish (z · σ(z)), found through automated search Ramachandran, 2017, has similar traits. In practice, the best choice depends on the task, though theory steers you away from vanishing gradients.

For the output layer, the task dictates the activation:

  • Sigmoid for binary classification
  • Softmax for multi-class classification
  • Identity (no activation) for regression
  • tanh for pixel values normalised to [-1, 1] in generative models
  • Sigmoid for values in [0, 1]

The distinction matters. Hidden-layer activations are chosen for gradient flow and representational power. Output-layer activations are chosen for probabilistic or domain-specific correctness. Conflating the two is a common beginner mistake.

9.4   Backpropagation

Backpropagation is the algorithm that makes deep learning possible.

It computes the gradient of the loss with respect to every weight, letting you train models with millions or billions of parameters. The idea was found several times: by Bryson and Ho (1969), Werbos (1974) Werbos, 1974, and most famously Rumelhart, Hinton, and Williams (1986) Rumelhart, 1986. At its core, backprop is the chain rule applied to the network's computation graph.

Consider a network with L layers. The forward pass computes activations layer by layer:

  • Pre-activation: zl = Wlhl-1 + bl
  • Post-activation: hl = sigma(zl)
  • Input: h0 = x

The loss L is computed from the output hL and the true target y. The backward pass propagates the gradient from output to input. At each layer, you compute partial derivatives dL/dWl and dL/dbl. The key recursive relationship is the error signal at layer l: deltal = (dL/dzl) = (Wl+1^T^deltal+1) . sigma'(zl), where . denotes element-wise multiplication. The weight gradient follows: dL/dWl = deltalhl-1^T^.

The beauty is speed. A brute-force method — nudge each weight, measure the change — would need O(P) forward passes. Backprop gets the full gradient in one backward pass, costing about twice a forward pass. It reuses work: derivatives from deeper layers feed into shallower ones (dynamic programming). Modern frameworks (PyTorch, TensorFlow, JAX) do this via automatic differentiation. They build a graph during the forward pass and traverse it in reverse — no manual math needed.

Several numerical issues arise in practice:

  • Vanishing gradients: gradients shrink exponentially through many layers, especially with saturating activations (sigmoid, tanh)
  • Exploding gradients: the converse -- gradients grow exponentially, causing overflow and instability
  • Gradient clipping: rescaling the gradient vector when its norm exceeds a threshold, a simple fix for explosions
  • Residual connections (skip connections): adding a layer's input directly to its output, creating a "gradient highway" that enables training of networks with hundreds or thousands of layers

In practice, you pair backprop with SGD or a variant. Instead of using the full training set, SGD estimates the gradient from a random mini-batch (32–512 examples). The noise helps — it lets the algorithm escape shallow traps and saddle points. Adam tracks running averages of each weight's gradient, adapting the learning rate to the loss surface. Learning rate schedules (step decay, cosine annealing, warmup) further improve results.

Backprop solved the central critique from Minsky and Papert: you could not train multilayer models. It turned neural networks from curiosities into practical tools. Today it runs trillions of times per day on the world's GPUs. It trains models that write text, recognise faces, fold proteins, and steer vehicles. The concept is just the chain rule, applied with care. The impact has been vast.

9.5   Network Architectures

Different data shapes demand different network shapes.

The fully connected architecture from Section 9.2 is the most general topology. But generality has a cost. Parameter count grows quadratically with layer width. Every feature connects to every hidden unit, ignoring spatial, temporal, or relational structure. For images, sequences, and graphs, this is wasteful and even counterproductive.

Convolutional neural networks (CNNs) Lecun, 1998 exploit the spatial structure of images. They replace fully connected layers with convolutional layers. A small learnable filter (kernel) slides across the input to produce a feature map. This introduces two inductive biases:

  • Translation equivariance: a pattern is detected regardless of where it appears
  • Local connectivity: each output depends only on a small input region

Pooling layers downsample the feature maps, adding translation invariance and reducing spatial dimensions. The standard CNN architecture alternates convolutional and pooling layers, followed by dense layers for classification. AlexNet's victory in the 2012 ImageNet competition Krizhevsky, 2012 established this approach as the default in computer vision.

Recurrent neural networks (RNNs) handle sequential data. They maintain a hidden state updated at each time step: ht = sigma(Whhht-1 + Wxhxt + b). This recurrence lets information persist across time steps, modelling temporal dependencies in text, speech, and time series. Vanilla RNNs suffer from severe vanishing and exploding gradients over long sequences. Long short-term memory (LSTM) networks Hochreiter, 1997 address this with a gating mechanism that controls information flow through a memory cell, enabling long-range dependencies. Gated recurrent units (GRUs) Cho, 2014 are a simplified variant with fewer parameters and comparable performance.

The transformer Vaswani, 2017 dispenses with recurrence entirely. Instead, self-attention mechanisms model dependencies between all positions in a sequence simultaneously. This parallelism makes training dramatically faster than sequential RNNs. The attention mechanism computes a weighted sum of value vectors. The weights come from the compatibility (dot product) between query and key vectors, scaled by sqrt(dk) to prevent softmax saturation. Transformers now dominate natural language processing (BERT Devlin, 2019, GPT, T5 Raffel, 2019) and are spreading to vision (ViT Dosovitskiy, 2020), audio, and multimodal domains.

Residual networks (ResNets) He, 2016 added skip connections: hl+1 = sigma(Wlhl + bl) + hl. This simple change enabled training of networks with over 100 layers. Gradients flow directly through skip connections without attenuation. The residual pattern now appears everywhere: in CNNs, in transformers (every sublayer uses it), and in diffusion models. Other notable innovations include:

  • DenseNet Huang, 2017: each layer receives outputs from all preceding layers
  • MobileNet Howard, 2017: depthwise separable convolutions for efficiency
  • Neural architecture search (NAS): using reinforcement learning or evolution to discover optimal architectures automatically

Architecture choice is often the most consequential decision in deep learning. Domain-specific designs encode inductive biases that improve sample efficiency and generalisation: CNNs for images, transformers for sequences, graph neural networks for molecules and social networks. Yet the transformer's successful invasion of domains far beyond NLP suggests that sufficient scale can sometimes compensate for missing domain structure. This tension lies at the heart of research into foundation models.

9.6   Universal Approximation Theorem

You can build any continuous function from a single hidden layer -- in theory.

Cybenko (1989) Cybenko, 1989 and Hornik et al. (1989) Hornik, 1989 proved it independently. The theorem: a network with one hidden layer of finite width and a smooth activation (like sigmoid) can approximate any continuous function on a bounded region to any desired accuracy. For any continuous f and any ε > 0, there exists a one-layer network g such that the maximum gap |f(x) − g(x)| < ε everywhere on the domain.

The intuition is geometric. Each hidden unit defines a hyperplane in input space via its weights and bias. The activation function "turns on" as the input crosses that hyperplane. Combine many units with different orientations and thresholds, and you build a patchwork of overlapping responses. Summed with appropriate output weights, this patchwork approximates any target surface. For sigmoid activations, each unit contributes a smooth "step." For ReLU, each contributes a "hinge," building piecewise linear approximations. More units yield finer resolution, much like more terms in a Fourier series yield better approximation of a periodic function.

You must understand what the theorem does not guarantee. It is a pure existence result. It says nothing about:

  • How to find the right weights
  • How many units you need (width may grow exponentially with input dimension)
  • Whether gradient-based training will converge
  • Whether the network will generalise to unseen data
  • Whether training is computationally tractable

Universal approximation tells you the network is expressive enough. It does not tell you it is learnable or efficient.

Later work extended the original result in important directions. Leshno et al. (1993) Leshno, 1993 dropped the boundedness assumption: any non-polynomial activation suffices, encompassing ReLU and its variants. Lu et al. (2017) proved a width-bounded version: networks with width d + 4 can approximate any Lebesgue-integrable function, provided depth is allowed to grow. This highlights a fundamental width-versus-depth trade-off. A single wide layer suffices in principle. But deeper, narrower networks can represent certain functions exponentially more efficiently.

This trade-off has real consequences. Deep networks consistently outperform shallow ones on complex tasks, even with the same total parameter count. Telgarsky (2016) Telgarsky, 2016 and Eldan and Shamir (2016) formalised this. They exhibited functions computable by depth-k networks with polynomial width that require exponential width at depth k - 1. These depth separation results explain why depth matters in practice. It lets the network compose features hierarchically, building complex representations from simpler ones far more efficiently than a flat, wide architecture.

The theorem holds a unique place in deep learning. It tells you the network is rich enough — in theory — to solve any regression or classification task on continuous data. But going from theory to practice requires everything else: backprop, SGD, regularisation, good architecture, and massive compute. The theorem is not a recipe for success. It is a licence to search. The needle is in the haystack. Finding it takes skill, experience, and a lot of GPUs.