Also known as: rectified linear unit, rectifier
The rectified linear unit is the activation function
$$\mathrm{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases}$$
with derivative
$$\mathrm{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \end{cases}$$
(the derivative at $x=0$ is undefined; in practice frameworks use 0).
Introduced for neural networks by Hahnloser (2000) and popularised in deep learning by Nair and Hinton (2010), ReLU has been the default hidden-layer activation since the 2012 AlexNet result. It has three crucial advantages over sigmoid and tanh: non-saturating positive regime, gradients do not vanish for large positive inputs, so deep networks train without vanishing gradients; computational simplicity, a single comparison and possibly a sign flip; sparse activation, roughly half of units output zero on any given input, encouraging efficient representations.
The main pathology is the dying ReLU problem, units whose pre-activation becomes consistently negative produce zero output and zero gradient, never recovering. Variants address this:
$$\mathrm{LeakyReLU}(x) = \max(\alpha x, x), \quad \alpha \approx 0.01$$
$$\mathrm{PReLU}(x) = \max(\alpha x, x), \quad \alpha \text{ learned}$$
$$\mathrm{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha (e^x - 1) & x \leq 0 \end{cases}$$
$$\mathrm{GELU}(x) = x \cdot \Phi(x) \approx \tfrac{1}{2} x \bigl(1 + \tanh\bigl(\sqrt{2/\pi}(x + 0.044715 x^3)\bigr)\bigr)$$
where $\Phi$ is the standard normal CDF. GELU, introduced by Hendrycks and Gimpel (2016), is the standard activation in modern Transformers (BERT, GPT, T5, LLaMA up to 2). Modern LLaMA-3 onwards and Mistral use SiLU/Swish $x \cdot \sigma(x)$ or SwiGLU gated variants.
Interactive
Video
Related terms: Sigmoid Function, Tanh, Activation Function, Vanishing Gradient Problem
Discussed in:
- Chapter 9: Neural Networks, Activation Functions