Activation Function, Glossary, Textbook of AI

An activation function is the elementwise nonlinear transformation applied to the weighted sum at each neuron in an artificial neural network. Without nonlinearity, a multilayer network would collapse into a single linear transformation, the composition of linear maps is itself linear, and depth would add no expressive power. The choice of activation profoundly affects training dynamics, representational capacity, and ultimate generalisation.

For a neuron with input vector $\mathbf{x}$, weights $\mathbf{w}$ and bias $b$, the output is $a = \phi(\mathbf{w}^\top \mathbf{x} + b)$, where $\phi$ is the activation. The universal approximation theorem (Cybenko 1989; Hornik 1991) shows that a feedforward network with a single hidden layer and any non-polynomial activation can approximate any continuous function on a compact domain to arbitrary accuracy, provided enough hidden units.

Classical activations

The logistic sigmoid $\sigma(x) = 1 / (1 + e^{-x})$ squashes inputs into $(0, 1)$ and was the dominant activation in the 1980s and 1990s. Its derivative $\sigma'(x) = \sigma(x)(1-\sigma(x))$ peaks at $0.25$ and decays exponentially for large $|x|$, producing the vanishing gradient problem: stacked sigmoids drive gradients toward zero and stall learning in deep networks. The hyperbolic tangent $\tanh(x) = (e^x - e^{-x}) / (e^x + e^{-x})$ has the same shape but is zero-centred, which gives slightly better optimisation, but suffers the same saturation pathology.

ReLU and its descendants

The Rectified Linear Unit $\mathrm{ReLU}(x) = \max(0, x)$ was popularised by Nair and Hinton (2010) and Glorot, Bordes and Bengio (2011). It is computationally trivial, does not saturate for positive inputs, induces sparsity (about half of units are inactive at random initialisation), and dramatically improved training of deep networks. It is largely responsible for the practicality of the deep-learning revolution.

Its principal drawback is the dying ReLU problem: if a unit's weights are updated such that its pre-activation is always negative, its gradient is zero and it never recovers. Variants address this:

$$\mathrm{LeakyReLU}(x) = \max(\alpha x, x), \quad \alpha \approx 0.01$$

$$\mathrm{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}$$

$$\mathrm{GELU}(x) = x \, \Phi(x)$$

where $\Phi$ is the Gaussian cumulative distribution function. GELU (Hendrycks and Gimpel, 2016) and the closely related Swish $x \sigma(\beta x)$ (Ramachandran et al., 2017) are smooth, non-monotonic, and have become the default in transformer architectures including BERT, GPT-3 and Llama. SELU (Klambauer et al., 2017) is self-normalising: with carefully chosen scaling it preserves zero mean and unit variance through the layers.

Output activations

For output layers, the activation is determined by the task. Sigmoid for binary classification (giving a probability), softmax for multi-class classification (a categorical distribution), and identity (no activation) for regression. The softmax $\mathrm{softmax}(\mathbf{z})_i = e^{z_i} / \sum_j e^{z_j}$ is the multivariate generalisation of the sigmoid and pairs naturally with cross-entropy loss.

Modern practice combines layer normalisation, careful initialisation (He, Xavier) and these smoother activations to train networks hundreds of layers deep without vanishing or exploding gradients.

Interactive

Sigmoid, tanh, ReLU, GELU side by side. Each squashes a real number through a nonlinear curve. The choice shapes how gradients flow.

Video

Discussed in:

Chapter 6: ML Fundamentals, Neural Networks

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.