An activation function is the elementwise nonlinear transformation applied to the weighted sum at each neuron in an artificial neural network. Without nonlinearity, a multilayer network would collapse into a single linear transformation, the composition of linear maps is itself linear, and depth would add no expressive power. The choice of activation profoundly affects training dynamics, representational capacity, and ultimate generalisation.
For a neuron with input vector $\mathbf{x}$, weights $\mathbf{w}$ and bias $b$, the output is $a = \phi(\mathbf{w}^\top \mathbf{x} + b)$, where $\phi$ is the activation. The universal approximation theorem (Cybenko 1989; Hornik 1991) shows that a feedforward network with a single hidden layer and any non-polynomial activation can approximate any continuous function on a compact domain to arbitrary accuracy, provided enough hidden units.
Classical activations
The logistic sigmoid $\sigma(x) = 1 / (1 + e^{-x})$ squashes inputs into $(0, 1)$ and was the dominant activation in the 1980s and 1990s. Its derivative $\sigma'(x) = \sigma(x)(1-\sigma(x))$ peaks at $0.25$ and decays exponentially for large $|x|$, producing the vanishing gradient problem: stacked sigmoids drive gradients toward zero and stall learning in deep networks. The hyperbolic tangent $\tanh(x) = (e^x - e^{-x}) / (e^x + e^{-x})$ has the same shape but is zero-centred, which gives slightly better optimisation, but suffers the same saturation pathology.
ReLU and its descendants
The Rectified Linear Unit $\mathrm{ReLU}(x) = \max(0, x)$ was popularised by Nair and Hinton (2010) and Glorot, Bordes and Bengio (2011). It is computationally trivial, does not saturate for positive inputs, induces sparsity (about half of units are inactive at random initialisation), and dramatically improved training of deep networks. It is largely responsible for the practicality of the deep-learning revolution.
Its principal drawback is the dying ReLU problem: if a unit's weights are updated such that its pre-activation is always negative, its gradient is zero and it never recovers. Variants address this:
$$\mathrm{LeakyReLU}(x) = \max(\alpha x, x), \quad \alpha \approx 0.01$$
$$\mathrm{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \leq 0 \end{cases}$$
$$\mathrm{GELU}(x) = x \, \Phi(x)$$
where $\Phi$ is the Gaussian cumulative distribution function. GELU (Hendrycks and Gimpel, 2016) and the closely related Swish $x \sigma(\beta x)$ (Ramachandran et al., 2017) are smooth, non-monotonic, and have become the default in transformer architectures including BERT, GPT-3 and Llama. SELU (Klambauer et al., 2017) is self-normalising: with carefully chosen scaling it preserves zero mean and unit variance through the layers.
Output activations
For output layers, the activation is determined by the task. Sigmoid for binary classification (giving a probability), softmax for multi-class classification (a categorical distribution), and identity (no activation) for regression. The softmax $\mathrm{softmax}(\mathbf{z})_i = e^{z_i} / \sum_j e^{z_j}$ is the multivariate generalisation of the sigmoid and pairs naturally with cross-entropy loss.
Modern practice combines layer normalisation, careful initialisation (He, Xavier) and these smoother activations to train networks hundreds of layers deep without vanishing or exploding gradients.
Interactive
Video
Related terms: ReLU, Sigmoid Function, Softmax, Vanishing Gradient, Backpropagation, Universal Approximation Theorem
Discussed in:
- Chapter 6: ML Fundamentals, Neural Networks