Glossary

Universal Approximation Theorem

The Universal Approximation Theorem is the theoretical bedrock for the expressive power of neural networks. Proved independently by Cybenko (1989) and Hornik, Stinchcombe, and White (1989), it states that a feedforward network with a single hidden layer containing a finite number of units, equipped with a non-constant, bounded, and continuous activation function, can approximate any continuous function on a compact subset of $\mathbb{R}^n$ to arbitrary accuracy, provided the hidden layer is sufficiently wide. Leshno et al. (1993) later showed the boundedness requirement can be dropped, so any non-polynomial activation suffices—including ReLU.

What the theorem does not guarantee is equally important. It is a pure existence result: it says that some network exists, but not how to find it, how many units are needed (which may be exponentially large in the input dimension), or whether gradient-based training will converge. It provides no guarantees about generalisation. In short, universal approximation tells us neural networks are expressive enough, but not that they are learnable or efficient.

Subsequent work has refined the theorem. Depth separation results show that certain functions can be represented by networks of depth $k$ with polynomial width but require exponential width for depth $k-1$, providing theoretical support for the empirical superiority of depth. Width-bounded universal approximation theorems show that networks of width $d+4$ suffice if depth is allowed to grow. Together, these results explain why deep networks are so powerful: depth enables efficient hierarchical composition of features.

Related terms: Neural Network, Deep Learning, Multilayer Perceptron

Discussed in:

Also defined in: Textbook of AI