Add hidden units one by one and watch the approximation tighten.
From the chapter: Chapter 9: Neural Networks
Glossary: universal approximation theorem
Transcript
The universal approximation theorem. A neural network with one hidden layer of sufficient width can approximate any continuous function on a compact domain, to any desired accuracy.
A target curve, say a clean sine wave. Try to fit it.
One hidden unit, one bend. The fit is a single sigmoid step. Most of the target is missed.
Two units. Two bends, two sigmoids combined. A rough approximation.
Five units. The shape begins to emerge. Peaks appear in roughly the right places.
Twenty units. The fit is visually indistinguishable from the target. Each unit contributes a small bend, and their weighted sum traces the curve.
The theorem says: with enough units, any continuous function. The proof uses density arguments; it does not say how many units, nor how to find the weights.
In practice, very wide single hidden layers are inefficient. Deep narrow networks reach the same expressive power with exponentially fewer parameters in many cases.
The theorem matters because it answers an existence question. Yes, neural networks are expressive enough. The harder question is optimisation. Can gradient descent actually find the right weights, given enough data, in reasonable time? That is what the rest of deep learning is about.