Glossary

ReLU

Also known as: rectified linear unit, rectifier

The rectified linear unit is the activation function

$$\mathrm{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{otherwise} \end{cases}$$

with derivative

$$\mathrm{ReLU}'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \end{cases}$$

(the derivative at $x=0$ is undefined; in practice frameworks use 0).

Introduced for neural networks by Hahnloser (2000) and popularised in deep learning by Nair and Hinton (2010), ReLU has been the default hidden-layer activation since the 2012 AlexNet result. It has three crucial advantages over sigmoid and tanh: non-saturating positive regime, gradients do not vanish for large positive inputs, so deep networks train without vanishing gradients; computational simplicity, a single comparison and possibly a sign flip; sparse activation, roughly half of units output zero on any given input, encouraging efficient representations.

The main pathology is the dying ReLU problem, units whose pre-activation becomes consistently negative produce zero output and zero gradient, never recovering. Variants address this:

$$\mathrm{LeakyReLU}(x) = \max(\alpha x, x), \quad \alpha \approx 0.01$$

$$\mathrm{PReLU}(x) = \max(\alpha x, x), \quad \alpha \text{ learned}$$

$$\mathrm{ELU}(x) = \begin{cases} x & x > 0 \\ \alpha (e^x - 1) & x \leq 0 \end{cases}$$

$$\mathrm{GELU}(x) = x \cdot \Phi(x) \approx \tfrac{1}{2} x \bigl(1 + \tanh\bigl(\sqrt{2/\pi}(x + 0.044715 x^3)\bigr)\bigr)$$

where $\Phi$ is the standard normal CDF. GELU, introduced by Hendrycks and Gimpel (2016), is the standard activation in modern Transformers (BERT, GPT, T5, LLaMA up to 2). Modern LLaMA-3 onwards and Mistral use SiLU/Swish $x \cdot \sigma(x)$ or SwiGLU gated variants.

Interactive

One neuron, forward pass and backward pass. Forward pass produces a value; backprop sends gradients back through the same graph.
Sigmoid, tanh, ReLU, GELU side by side. Each squashes a real number through a nonlinear curve. The choice shapes how gradients flow.

Video

Related terms: Sigmoid Function, Tanh, Activation Function, Vanishing Gradient Problem

Discussed in:

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.