An Activation Function is a nonlinear function applied to the output of a neuron's weighted sum. Without nonlinearity, a multilayer network would collapse into a single linear transformation—the composition of linear maps is itself linear—and depth would add no expressive power. The choice of activation function profoundly affects training dynamics, representational capacity, and ultimate performance.
The sigmoid $\sigma(x) = 1 / (1 + e^{-x})$ and tanh functions dominated early neural networks but suffer from the vanishing gradient problem: their derivatives are small for large absolute inputs, halting learning in deeply saturated units. The ReLU (Rectified Linear Unit), $\text{ReLU}(x) = \max(0, x)$, transformed deep learning when popularised in 2010. It is computationally trivial, does not saturate for positive inputs, induces sparsity, and dramatically improved training of deep networks. Its principal drawback is the dying ReLU problem: units can get stuck outputting zero.
Variants include Leaky ReLU (small non-zero slope for negative inputs), Parametric ReLU (learnable slope), ELU (smooth negative tail), SELU (self-normalising), GELU (smooth, stochastic gating), and Swish ($x \sigma(x)$). GELU has become the default in transformer architectures. For output layers, the activation is determined by the task: sigmoid for binary classification, softmax for multi-class, identity for regression.
Related terms: ReLU, Softmax, Neural Network
Discussed in:
- Chapter 9: Neural Networks — Activation Functions
Also defined in: Textbook of AI, Textbook of Medical AI