Neural Networks: 9.5   The universal approximation theorem

Dr Chris Paton

9.5 The universal approximation theorem

How expressive is a multilayer perceptron? Could there be functions that, no matter how cleverly we choose its weights, an MLP simply cannot represent? The answer, given by a pair of results from 1989, is reassuring: a single-hidden-layer network with enough neurons can approximate any continuous function to any desired precision. The technical name is the universal approximation theorem, and it is the theoretical justification for using neural networks at all. In plain words it says that the family of functions an MLP can express is rich enough, in principle, to represent anything we might want it to compute, from the price of a house given its features to the position of a robot's arm given its joint angles.

That is the headline. The fine print, however, demands attention. "Enough neurons" can mean astronomically many, so many, in some cases, that the network would not fit on any conceivable computer. Worse, the theorem is purely an existence result: it asserts that the right weights are out there somewhere, but it gives no algorithm for finding them. It says nothing about whether stochastic gradient descent will converge to those weights, nor about whether a network that fits the training data will generalise to new examples. Universal approximation is therefore best understood as a foundation rather than a guarantee: it tells us neural networks are not in principle the wrong tool, but it does not tell us they will work in practice. The rest of this chapter, and much of Chapter 10, is concerned with the gap between in-principle and in-practice.

This section closes the question of "what can networks represent?" before §9.6 turns to "how do we train them?".

Symbols Used Here

$f$the target function we want to approximate (a real-valued or vector-valued continuous function)

$\hat f$the network's realisation as a function of $\mathbf{x}$

$\mathbf{x}$input vector in $\mathbb{R}^n$

$K \subset \mathbb{R}^n$a compact (closed and bounded) subset of input space

$C(K)$the space of continuous functions on $K$

$\sigma$activation function used in the hidden layer

$N$number of hidden-layer neurons

$\epsilon$desired approximation tolerance, a small positive real number

$\|\cdot\|_\infty$supremum norm: $\|f - \hat f\|_\infty = \sup_{\mathbf{x} \in K} |f(\mathbf{x}) - \hat f(\mathbf{x})|$

$w_i, b_i, c_i$scalar parameters of the network in the simplest 1-hidden-layer form

What "approximate" actually means

Before we can state precisely what universal approximation guarantees, we need to be precise about what we mean by "$\hat f$ is close to $f$". There are several reasonable notions of closeness for functions. Two functions might be close on average, in the sense that the integral of their squared difference is small; or they might be close in probability, in the sense that the probability of their disagreeing by more than a small amount is small. The notion used in the universal approximation theorem is the strictest of the natural choices: closeness uniformly, in the worst case across the entire input set.

The supremum norm, written $\|f - \hat f\|_\infty$, makes this precise. For two functions $f$ and $\hat f$ on a domain $K$, the supremum norm is

$$\|f - \hat f\|_\infty = \sup_{\mathbf{x} \in K} \, |f(\mathbf{x}) - \hat f(\mathbf{x})|.$$

In words: take the absolute pointwise difference $|f(\mathbf{x}) - \hat f(\mathbf{x})|$ at every point $\mathbf{x}$ in $K$, and the supremum norm is the largest such difference (or the smallest upper bound on those differences, if no actual largest exists). If $\|f - \hat f\|_\infty \le 0.01$, then no matter what input the user supplies, even the worst-case adversarial input, the network's output is within $0.01$ of the truth. There is no place in $K$ where the network is allowed to make a large error in exchange for being very accurate elsewhere; every input must be handled to the agreed tolerance.

A small worked example clarifies the idea. Take $f(x) = \sin(x)$ on $K = [0, \pi]$ and the cubic Taylor approximation $\hat f(x) = x - x^3/6$. The pointwise error $|f(x) - \hat f(x)|$ is zero at $x = 0$ and grows as $x$ increases. To find the supremum norm we look for the largest value of this error on $[0, \pi]$. By inspection (or by differentiating the error and solving), the largest error occurs at $x = \pi$, giving $|\sin(\pi) - (\pi - \pi^3/6)| = |0 - (\pi - \pi^3/6)| \approx 2.026$. So $\|f - \hat f\|_\infty \approx 2.026$ on this domain, a poor approximation, dominated by the worst point. If instead we restricted ourselves to $K = [0, 1]$, the worst-case error would drop to about $0.001$ at $x = 1$, and the cubic Taylor polynomial would count as an excellent approximation.

This worst-case framing is much stronger than, say, low average error. A function that is almost perfect everywhere but spikes badly at a single point can still have low average error; it cannot have low supremum-norm error. For a network to satisfy $\|f - \hat f\|_\infty \le \epsilon$ it must be uniformly accurate. That is the standard the universal approximation theorem promises to meet.

Cybenko (1989) and Hornik et al. (1989): the formal statement

We can now state the theorem precisely.

Theorem (Cybenko 1989, paraphrased). Let $\sigma$ be a continuous sigmoidal activation, meaning $\sigma(z) \to 0$ as $z \to -\infty$ and $\sigma(z) \to 1$ as $z \to \infty$. Let $K \subset \mathbb{R}^n$ be a compact set and let $f \in C(K)$ be a continuous function on $K$. Then for every $\epsilon > 0$ there exist a positive integer $N$, real numbers $c_i, b_i$ and vectors $\mathbf{w}_i \in \mathbb{R}^n$ for $i = 1, \ldots, N$, such that the function

$$\hat f(\mathbf{x}) = \sum_{i=1}^{N} c_i \, \sigma\!\bigl(\mathbf{w}_i^\top \mathbf{x} + b_i\bigr)$$

satisfies $\|f - \hat f\|_\infty \le \epsilon$.

Read carefully, this is the formal version of the headline. The function $\hat f$ is exactly a one-hidden-layer MLP: each hidden neuron $i$ takes the affine combination $\mathbf{w}_i^\top \mathbf{x} + b_i$, passes it through the sigmoidal activation $\sigma$, and then the outputs are weighted by $c_i$ and summed. There is one hidden layer, $N$ neurons in it, and a single linear output. The theorem says that for any continuous $f$ on a bounded domain, and any tolerance $\epsilon$, suitable values of $N$, $\mathbf{w}_i$, $b_i$ and $c_i$ exist that make this network as close to $f$ as we wish in the supremum norm.

The quantifier structure is essential and worth memorising: for every continuous $f$ and every tolerance $\epsilon > 0$, there exist an integer $N$ and parameters making the approximation hold. The order matters. The theorem does not say there is a single network that approximates every $f$, nor that a single $N$ works for every $\epsilon$. As $\epsilon$ shrinks, $N$ may grow without bound; as $f$ becomes more complicated, $N$ may grow too.

Hornik, Stinchcombe and White (1989) proved a closely related result at almost the same time, generalising the activation requirement: any non-constant, bounded, continuous function $\sigma$ works, not just sigmoidal ones in the strict $0$-to-$1$ sense. Leshno, Lin, Pinkus and Schocken (1993) gave the cleanest extension of all. They showed that a continuous activation function gives universal approximation if and only if it is non-polynomial. Sigmoid, tanh, ReLU, GELU, SiLU, and softplus all qualify; an activation that happened to be a polynomial of bounded degree would not, because polynomials of bounded degree span only a finite-dimensional subspace of $C(K)$ and therefore cannot approximate arbitrary continuous functions. This non-polynomial criterion is the reason every modern activation function we have met is some non-polynomial squashing or rectifying nonlinearity; it is not a stylistic choice but a theoretical requirement.

Several details of the theorem are easy to miss on a first reading. First, $f$ must be continuous: the theorem says nothing about discontinuous targets such as a function with infinite jumps. Second, $K$ must be compact, which in $\mathbb{R}^n$ means closed and bounded. The theorem cannot guarantee uniform approximation on all of $\mathbb{R}^n$; for example, no MLP can uniformly approximate the function $f(x) = x^2$ for arbitrarily large $|x|$ because its outputs are bounded and $f$ is not. Third, $N$ depends on $\epsilon$ and on $f$: smaller error and more wiggly target functions both require more hidden units. Fourth, the theorem is purely about the existence of suitable parameters; it gives no algorithm for finding them, no bound on how long any algorithm would take, and no information about the geometry of the loss landscape that gradient descent might explore. It is, in the technical sense, a statement about expressive capacity rather than about learning.

An intuition for why this is true

The full proof of Cybenko's theorem is functional-analytic and uses the Hahn–Banach theorem. We give instead an informal one-dimensional sketch that conveys why the result should be plausible.

A foundational fact from real analysis is that any continuous function on a compact set is uniformly continuous. In ordinary language, this means the function does not have arbitrarily steep slopes: there is a single notion of "small change in input" that produces "small change in output" everywhere on $K$ at once. As a consequence, $f$ can be approximated by a piecewise constant function, a sort of staircase that holds a fixed value over each of a finite number of small bins. The more bins, the closer the staircase fits the continuous target.

Now consider a single sigmoid neuron with one input. The function $\sigma(wx + b)$ is approximately $0$ when $wx + b$ is very negative, approximately $1$ when $wx + b$ is very positive, and transitions between the two near $x = -b/w$. By making $w$ large we make the transition sharp; the result looks like a smoothed step function whose location is controlled by $b$ and whose direction is controlled by the sign of $w$. With two such steps of opposite sign and the same location, summed with appropriate output weights, we can construct a smoothed bump, a function that is approximately a constant height over a narrow interval and approximately zero elsewhere.

The construction now writes itself. To approximate $f$ on $[0, 1]$ with $N$ bins, place a smoothed bump of width $1/N$ at each bin centre $x_i = (i - 1/2)/N$, and weight the $i$-th bump by the value $f(x_i)$. The sum of these weighted bumps is a smoothed staircase whose height in bin $i$ is approximately $f(x_i)$. Because $f$ is uniformly continuous, the staircase is within some $\epsilon(N)$ of $f$ everywhere, where $\epsilon(N) \to 0$ as $N \to \infty$ and the bumps are made sharper. Each bump uses two hidden sigmoid neurons, so a network with $2N$ hidden neurons can approximate $f$ to within $\epsilon(N)$ in supremum norm.

Higher-dimensional cases work by similar but more elaborate constructions, building higher-dimensional bumps from sums of one-dimensional ridges. The key point is that we can build arbitrarily fine local features out of suitably tuned sigmoids, and any continuous function can be assembled, piece by piece, from enough local features. The full proof handles the technicalities, for instance, that the construction extends to all continuous activations satisfying Cybenko's conditions, not just literal sigmoids, but the bumps-and-staircase picture captures the essential reason the theorem is true.

What the theorem does NOT say

The universal approximation theorem is often quoted in popular accounts of deep learning as if it explained the success of neural networks. It does not. There are at least five important caveats, each of which corresponds to an active area of theoretical research.

(a) No algorithm. The theorem asserts the existence of weights that make $\hat f$ close to $f$, but it does not say how to find them. In particular, it does not say that stochastic gradient descent, given a finite training set drawn from $f$, will converge to those weights. The MLP loss landscape is non-convex with many local minima and saddle points; in general SGD might converge to a poor solution, fail to converge at all, or converge slowly.

(b) Astronomical width. The number $N$ of hidden units required to achieve $\epsilon$ accuracy can grow extremely fast as $\epsilon$ shrinks. For badly behaved $f$, for instance, functions with very rapid oscillations or with most of their variation concentrated in a small region, the required $N$ can be exponential in $1/\epsilon$ or in the input dimension $n$. The theorem allows the network to be a million or a trillion neurons wide; it places no upper bound on the cost of expressiveness.

(c) No generalisation. The theorem is about approximating a known target function $f$. In real machine learning we never see $f$ directly; we see only a finite training set of $(\mathbf{x}_i, y_i)$ pairs sampled from a distribution. A network whose parameters fit the training set perfectly may approximate the underlying $f$ to within $\epsilon$ on the training points and yet fail completely on a held-out test set. Generalisation depends on the data quantity, the network's effective capacity, and the regularisation we apply, none of which the theorem touches.

(d) Width is not optimal. Even though one hidden layer suffices in principle, the theorem does not say one hidden layer is the best architecture, and indeed we shall see in the next subsection that depth is exponentially more efficient for many natural function classes. Modern deep networks are deep not because depth enlarges the expressible family beyond what UAT covers, it does not, but because depth makes useful subclasses cheap to represent.

(e) Compactness of the domain. The theorem gives uniform approximation only on a compact set $K$, and it says nothing about behaviour outside $K$. A network trained to approximate $f$ on $[-10, 10]$ may behave wildly on inputs of magnitude $10^6$. This is one face of the well-known problem of out-of-distribution generalisation: networks have no contractual obligation to behave sensibly outside the input region they were trained on.

Why width alone is not enough: the depth–width trade-off

The universal approximation theorem treats one hidden layer as sufficient, but it does not treat one hidden layer as efficient. A different line of research, the depth separation results of the 2010s, shows that, for many natural function classes, depth gives exponential gains over width.

Telgarsky (2016) constructed a family of one-dimensional functions, computable by a deep ReLU network of depth $L$ and width $\mathcal{O}(L)$, that any depth-2 network requires $\Omega(2^L)$ neurons to express to a fixed accuracy. The construction uses repeated composition of a triangular sawtooth function: each additional layer roughly doubles the number of oscillations, so the depth-$L$ network produces a function with about $2^L$ oscillations using only linear-in-$L$ hidden units. To match this with a single hidden layer, every oscillation must be built individually, requiring an exponential number of neurons. The functions in Telgarsky's family are not contrived: they are the natural outputs of a hierarchical computation, the kind of pattern that arises whenever a problem has compositional structure.

Eldan and Shamir (2016) gave a separation between three-layer and two-layer networks in higher dimensions. They exhibited a function on $\mathbb{R}^n$ that a three-layer network can compute with polynomially-many neurons in $n$, but that any two-layer network requires exponentially many neurons in $n$ to approximate. Their construction uses a radial function whose dependence on the input norm is hard to capture with a single hidden layer's worth of ridge functions but easy to capture once a second hidden layer is added.

The implication for practice is significant. Universal approximation says expressiveness is not the bottleneck, even one hidden layer is enough in principle. Depth separation says that hierarchical composition is exponentially more efficient than a single wide layer for functions with hierarchical structure. Empirically, the functions we want to learn, natural images, language, audio, scientific data, appear to have exactly such hierarchical structure, and so depth wins. The modern preference for very deep, moderately wide networks is justified not by an enlargement of the function class, which depth does not provide beyond what UAT already gives, but by a dramatic reduction in the number of parameters needed to represent useful functions cheaply.

Approximation, optimisation, and generalisation as three separate problems

Universal approximation is one of three sub-problems that successful machine learning must solve. Disentangling them helps clarify what UAT does and does not contribute.

(i) Approximation. Does there exist a network in our chosen function class that closely models the target $f$? UAT answers yes for any continuous $f$ on a compact domain, given enough hidden units. Depth separation refines the answer by showing that depth radically reduces the number of units needed for many natural $f$. This is the part of the problem where the theorem applies.

(ii) Optimisation. Given the architecture, can we find the parameters of a good network using SGD or a related algorithm? This is a question about the geometry of the loss landscape, the choice of learning rate and initialisation, and the dynamics of gradient flow on a non-convex surface. It depends on phenomena UAT does not address: implicit regularisation, the role of overparameterisation, the linear-mode connectivity of solutions, and many others. A network that exists in principle is useless if SGD cannot find it.

(iii) Generalisation. The network we find on a training set must continue to work well on new data drawn from the same distribution. This depends on the data quantity, the network's effective capacity, and explicit and implicit regularisation. Surprisingly, very large modern networks generalise well even though the classical theory predicts they should overfit; this implicit regularisation phenomenon is a current research frontier.

UAT is purely a statement about (i). The rest of the chapter develops the tools needed for (ii): backpropagation, initialisation strategies, normalisation, and optimisers. Generalisation, problem (iii), is treated more thoroughly in Chapter 10. A common confusion is to read UAT as a complete explanation of why neural networks work; it is not. It is the theoretical guarantee that the search is not futile, no more.

What you should take away

Universal approximation says one-hidden-layer MLPs are universal function approximators on compact sets: for any continuous $f$ and any tolerance $\epsilon$, a sufficiently wide network exists that approximates $f$ to within $\epsilon$ in supremum norm.
The price of this universality is potentially astronomical width: $N$ may have to grow exponentially in $1/\epsilon$ or in input dimension for badly behaved $f$.
The theorem is purely about existence; it does not give an algorithm for finding the right weights, nor any guarantee that gradient descent will find them.
Depth gives exponential efficiency improvements over width-only construction, even though it does not enlarge the expressible function class beyond what UAT already covers.
Approximation, optimisation, and generalisation are three different problems; UAT addresses only the first.
UAT is the theoretical justification for using neural networks at all, but it does not, by itself, explain why deep learning works in practice.
The empirical performance of modern networks is far better than UAT alone would predict, suggesting that modern training methods, architectures, and data exploit structure that the theorem does not characterise, a structure whose nature is one of the central open questions in deep learning theory.