Glossary

ReLU

The Rectified Linear Unit (ReLU), defined as $\text{ReLU}(x) = \max(0, x)$, is the dominant activation function in modern deep learning. It outputs the input unchanged if positive, and zero otherwise. Despite its simplicity—or rather because of it—ReLU transformed deep learning when it was popularised by Nair and Hinton in 2010, enabling the training of much deeper networks than had been feasible with sigmoid or tanh activations.

ReLU's advantages are several. It is computationally trivial: a single threshold comparison, with gradient exactly 1 for positive inputs and exactly 0 for negative inputs. It does not saturate for positive inputs, avoiding the vanishing gradient problem that cripples sigmoid-based networks. It induces sparsity: units with negative pre-activations output exactly zero, so a typical trained network has many inactive units at any given input. And it works well in practice across a wide range of architectures and tasks.

The principal drawback is the dying ReLU problem: if a unit's bias shifts sufficiently negative, it outputs zero for every input in the training set, and since the gradient is also zero, it never recovers. Variants address this: Leaky ReLU allows a small negative slope, Parametric ReLU learns the slope, ELU uses a smooth exponential curve, and GELU applies a smooth stochastic gating. But vanilla ReLU remains the default for many convolutional networks, and its descendants (especially GELU) dominate transformer architectures.

Related terms: Activation Function, Vanishing Gradient

Discussed in:

Also defined in: Textbook of AI