9.4 Activation functions

In §9.3 we built the multilayer perceptron by stacking linear maps and inserting a small non-linear function, written generically as $\sigma$, between each pair of layers. We did not say much about what that function actually is, beyond noting that it had to be non-linear. The job of this section is to fill in that gap. We will look at the half-dozen activation functions you are likely to meet in modern code, work small numerical examples for each so you can see what they do to a number, and end with practical defaults you can use without much further thought.

Before we look at any specific activation, we need to be clear about why a non-linear function is mandatory. The argument is short, requires nothing beyond schoolroom algebra, and explains everything that follows.

Suppose we build a two-layer network without any activation between the layers. The first layer takes input $\mathbf{x}$ and produces $\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}$. The second layer takes that output and applies its own weights and bias, giving

$$\mathbf{W}^{(2)}(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) + \mathbf{b}^{(2)} = (\mathbf{W}^{(2)}\mathbf{W}^{(1)})\mathbf{x} + (\mathbf{W}^{(2)}\mathbf{b}^{(1)} + \mathbf{b}^{(2)}).$$

The right-hand side is a single matrix multiplied by $\mathbf{x}$, plus a single bias vector. That is exactly the form of one linear layer. We could replace our two-layer stack with the single matrix $\mathbf{W}' = \mathbf{W}^{(2)}\mathbf{W}^{(1)}$ and bias $\mathbf{b}' = \mathbf{W}^{(2)}\mathbf{b}^{(1)} + \mathbf{b}^{(2)}$ and obtain identical outputs for every input. The same argument extends to ten layers, a hundred layers, or a thousand. Without a non-linear activation, an arbitrarily deep network mathematically collapses into a single linear layer. The activation is what stops that collapse. It is the entire reason depth means anything.

This sets up two later sections. §9.10 and §9.11 look at how the choice of activation affects the size of gradients as they flow back through many layers, which turns out to govern whether deep networks can be trained at all.

Symbols Used Here
$z$the pre-activation, a real number formed by a weighted sum plus a bias before any non-linearity is applied
$\sigma(z)$sigmoid, defined as $\sigma(z) = 1/(1+e^{-z})$
$\tanh(z)$hyperbolic tangent
$\mathrm{ReLU}(z) = \max(0, z)$rectified linear unit
$\mathrm{LeakyReLU}(z)$leaky rectifier, identical to ReLU on the positive side and $\alpha z$ on the negative side
$\alpha$leak parameter, a small positive number, typically $0.01$
$\mathrm{ELU}(z)$$\mathrm{GELU}(z)$, $\mathrm{Swish}(z)$, smoother variants used in modern architectures
$\Phi(z)$Gaussian cumulative distribution function, the probability that a standard normal random variable is at most $z$
$\mathrm{softmax}(\mathbf{z})_i = e^{z_i}/\sum_j e^{z_j}$softmax, used to turn a vector of real numbers into a probability distribution
$\sigma'(z)$derivative of an activation function, used during backpropagation

Why non-linearity is essential

We just showed the algebra. It is worth pausing on the consequence. If you build a model called "deep" but forget to insert non-linearities, you have built a fancy way of writing a single linear regression. Every claim about depth, hierarchy, and learned features rests on a non-linear function being applied between layers. Pick any activation you like from the catalogue below; the catalogue exists because the choice of which non-linearity to use matters in practice, but the existence of some non-linearity is non-negotiable.

A useful mental picture: a linear layer can rotate, scale, and shear the input space, but it cannot bend it. The activation puts a kink in the surface. Stack many bent surfaces and you get something genuinely curved, with regions where the network behaves one way and other regions where it behaves quite differently. That ability to behave differently in different parts of input space is the machinery that lets a network classify cats and dogs, or fit a complicated function, or anything else interesting.

A second observation worth holding on to is that not all non-linearities are equally useful. The activation has to be non-polynomial to give the network its full approximating power, a result we will meet formally in §9.5, but beyond that requirement, the choice is largely an engineering decision. The functions that survive in modern practice are the ones that play well with gradient-based training, that do not saturate in ways that kill backpropagation, and that are cheap to compute on the hardware we actually own. Several mathematically clean activations have been proposed and quietly abandoned for failing one or more of these tests, so the catalogue below is a curated short-list rather than an exhaustive zoo.

Sigmoid

The sigmoid function is

$$\sigma(z) = \frac{1}{1 + e^{-z}}.$$

Here $z$ is any real number. The output always lies strictly between $0$ and $1$, and the curve has a smooth S-shape passing through the point $(0, 0.5)$. Worked values give the feel of it: $\sigma(0) = 0.5$, $\sigma(1) \approx 0.7311$, $\sigma(-1) \approx 0.2689$, $\sigma(10) \approx 0.99995$, and $\sigma(-10) \approx 0.000045$. The function squashes large positive inputs almost to $1$ and large negative inputs almost to $0$.

The derivative has a particularly clean form,

$$\sigma'(z) = \sigma(z)\,(1 - \sigma(z)),$$

which is largest at $z = 0$ where it equals $0.25$, and tiny in the tails. This is the source of sigmoid's reputation for vanishing gradients: in a deep stack, the gradient that backpropagation passes layer-to-layer gets multiplied by $\sigma'$ at each step, and because that multiplier never exceeds $0.25$ and is usually much smaller, the signal shrinks rapidly as it travels back through the network.

Sigmoid does have one genuine virtue. Its output looks like a probability, so it is the natural choice at the output of a binary classifier, where we want a number between $0$ and $1$ that can be read as "probability of class one". A second, more subtle drawback for hidden layers is that sigmoid is not zero-centred: every output is positive, which biases the gradient updates of the next layer's weights so that they all share a sign, leading optimisation to zigzag rather than head straight for the answer. For decades, sigmoid was the standard hidden-layer activation; today it has been almost completely displaced from hidden layers and survives chiefly at output layers and inside the gating mechanisms of LSTMs and GRUs, where its bounded $(0, 1)$ output is genuinely useful as a soft on/off switch.

Tanh

The hyperbolic tangent is closely related to sigmoid and has the formula

$$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}.$$

It is in fact a rescaled sigmoid: $\tanh(z) = 2\sigma(2z) - 1$. The output range is $(-1, 1)$, and the curve passes through the origin. Worked values: $\tanh(0) = 0$, $\tanh(1) \approx 0.7616$, $\tanh(-1) \approx -0.7616$, $\tanh(3) \approx 0.9951$. The derivative is $1 - \tanh^2(z)$, which reaches a maximum of $1$ at $z = 0$, four times larger than the sigmoid's peak derivative.

Being zero-centred, tanh fixes the bias-direction problem of sigmoid and so was the standard hidden-layer activation through the 1990s. It still saturates for inputs far from zero, however, so deep stacks of tanh units suffer from the same vanishing-gradient problem as sigmoid, just less severely. In modern code you will mostly meet tanh inside the gates of older recurrent architectures (LSTMs and GRUs) and very occasionally as a normalising trick at the output of a generative model that expects values in $[-1, 1]$.

ReLU

The rectified linear unit is

$$\mathrm{ReLU}(z) = \max(0, z).$$

If $z$ is positive, pass it through unchanged; if $z$ is negative, output zero. Worked values: $\mathrm{ReLU}(2) = 2$, $\mathrm{ReLU}(-3) = 0$, $\mathrm{ReLU}(0) = 0$, $\mathrm{ReLU}(0.5) = 0.5$. The derivative is $1$ for $z > 0$, $0$ for $z < 0$, and undefined at $z = 0$. By convention frameworks return either $0$ or $0.5$ at the kink, and the choice is irrelevant in practice because floating-point arithmetic almost never produces an exact zero.

ReLU is the workhorse of modern deep learning. It became dominant after Krizhevsky's 2012 ImageNet result and remains the default for hidden layers in standard MLPs and convolutional networks. Three properties explain its success. First, there is no saturation on the active side: for any positive input the gradient is exactly $1$, so signals of any magnitude pass cleanly through and gradients do not shrink as they travel back through layers. Second, the function is sparse: roughly half of all units output zero on a typical input, which produces fast computation and representations that mirror the sparsity of biological neural firing. Third, it is computationally trivial, a single comparison and a multiplexer, with no exponentials to evaluate.

The drawback is the so-called dead ReLU problem. If a unit's pre-activation $z$ becomes negative for every input it ever sees, then its output is zero and its gradient is also zero, so weight updates stop and the unit stays dead forever. The unit cannot dig itself out of the hole because there is no gradient to dig with. In a network that is poorly initialised or trained with a too-large learning rate, a substantial fraction of the units can die during the first few thousand steps and never come back. Careful initialisation (§9.10) keeps the rate of dying units low; the variants below address the problem directly by giving the negative side a non-zero slope.

Leaky ReLU and PReLU

Leaky ReLU is the smallest possible fix to the dying-unit problem. The formula is

$$\mathrm{LeakyReLU}(z) = \max(\alpha z, z),$$

where $\alpha$ is a small positive number, typically $0.01$. For positive $z$ the function is identical to ReLU; for negative $z$ it returns $\alpha z$ instead of zero. Worked values with $\alpha = 0.01$: $\mathrm{LeakyReLU}(2) = 2$, $\mathrm{LeakyReLU}(-3) = -0.03$, $\mathrm{LeakyReLU}(0) = 0$. The negative side has a small but non-zero gradient of $\alpha$, so a unit that drifts into the negative region still receives weight updates and can in principle climb back out.

Parametric ReLU, or PReLU, is the same idea with $\alpha$ promoted to a learnable parameter that the network optimises along with its weights. Each neuron, or each channel in a convolutional network, has its own leak. PReLU often gives a small but real gain on image classification, at the cost of one extra parameter per neuron.

ELU and SELU

The exponential linear unit is a smoother negative-side variant:

$$\mathrm{ELU}(z) = \begin{cases} z & z \ge 0,\\ \alpha(e^z - 1) & z < 0.\end{cases}$$

Here $\alpha$ is a positive constant, usually $1$. The negative tail saturates softly toward $-\alpha$ rather than running off to $-\infty$ as Leaky ReLU does. Because the negative outputs are themselves negative, the mean activation of an ELU layer sits closer to zero than that of a ReLU layer, which has a regularising effect similar in spirit to batch normalisation.

SELU, the scaled exponential linear unit, picks specific values of $\alpha$ and an outer scale $\lambda$, approximately $\alpha \approx 1.6733$ and $\lambda \approx 1.0507$, that make a fully-connected stack of SELU units self-normalising, meaning the activations stay close to zero mean and unit variance from layer to layer without needing an explicit normalisation step. The construction works only under fairly restrictive architectural assumptions and has not displaced batch normalisation in mainstream practice, but it is mathematically pretty.

GELU and Swish

The Gaussian error linear unit was the activation in BERT, GPT-2, and other encoder-era Transformers, but the actual default in every frontier large language model since Llama 2 (Llama 2/3/4, Mistral, DeepSeek, Qwen, Gemma, OLMo) is SwiGLU (Shazeer 2020). GELU is defined by

$$\mathrm{GELU}(z) = z\,\Phi(z),$$

where $\Phi(z)$ is the standard normal cumulative distribution function, the probability that a Gaussian random variable with mean $0$ and variance $1$ takes a value less than or equal to $z$. The intuition is that GELU passes the input through with a probability that grows smoothly from $0$ for very negative $z$ to $1$ for very positive $z$, blending ReLU's sparsity with sigmoid's smoothness. Because $\Phi$ involves an integral, a closed-form approximation is often used in code:

$$\mathrm{GELU}(z) \approx 0.5\,z\!\left(1 + \tanh\!\left(\sqrt{\tfrac{2}{\pi}}\,(z + 0.044715\,z^3)\right)\right).$$

Swish, also known as SiLU, has the related form

$$\mathrm{Swish}(z) = z\,\sigma(z),$$

where $\sigma$ is the ordinary sigmoid. Swish was discovered by an automated architecture search and behaves very similarly to GELU. Both functions are described as self-gated: the function uses a smoothed version of its own input as a gate that decides how much of the input to pass through. Swish appears in EfficientNet, in some Llama variants, and across diffusion models.

Both functions also have a small but mathematically interesting feature: they are non-monotonic. Look at GELU near $z = -0.5$ and you will see the curve dip slightly below zero before climbing back up toward the origin. ReLU has no such dip; its negative side is identically zero. The dip means GELU and Swish allow a small negative output for moderately negative inputs, which empirically seems to help training in very deep networks. Whether the non-monotonicity is doing the work, or whether the smoothness of the function is, is still debated, but the effect is reproducible enough that nearly every Transformer published since 2018 uses one of these two activations rather than plain ReLU.

Softmax (output layer only)

Softmax is different from the activations above in two important ways. It acts on a whole vector at once rather than on a single number, and it is used at the output of a network rather than between hidden layers. The formula is

$$\mathrm{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_j e^{z_j}},$$

where $\mathbf{z} = (z_1, z_2, \ldots, z_K)$ is the vector of pre-activations and $i$ indexes one component of that vector. The exponential makes every output positive; dividing by the total sum makes the outputs add to $1$. The result is a valid probability distribution over $K$ classes.

A worked example. Suppose the output layer of a three-class classifier produces $\mathbf{z} = (2.0, 1.0, 0.1)$. We compute $e^{2.0} \approx 7.389$, $e^{1.0} \approx 2.718$, $e^{0.1} \approx 1.105$. The denominator is $7.389 + 2.718 + 1.105 \approx 11.213$. Dividing each numerator by the denominator gives $\mathrm{softmax}(\mathbf{z}) \approx (0.659, 0.242, 0.099)$. The three numbers are positive and sum to $1.000$, as required. The first class is most likely, the second is the runner-up, and the third is least likely, and the relative ordering matches the ordering of the original $z$ values, as it must.

Softmax is not used as a hidden-layer activation. It ties units together: changing one $z_i$ changes every output, because every output depends on the full sum in the denominator. That coupling is exactly what we want at an output where the outputs must compete to add to $1$, but it would prevent hidden units from learning independent features.

Practical guidance in 2026

For hidden layers in MLPs and CNNs, ReLU is the default. It trains fast, costs almost nothing, and works well. For Transformers, GELU is the standard, with Swish/SiLU a close cousin found in some Llama variants and in diffusion models. For a binary output (yes/no, spam/not-spam, malignant/benign), use a single sigmoid neuron. For a multi-class output, use a softmax over the classes. Tanh is largely obsolete in new code, surviving mostly inside the gates of older recurrent architectures and in occasional normalising tricks at the output of generative models.

In well-tuned modern networks the choice between ReLU, Leaky ReLU, ELU, GELU, and Swish rarely makes a dramatic difference. Other things, initialisation scheme (§9.10), normalisation layers (§9.13), learning rate schedule (§9.14), will move your accuracy more than the activation function will. Pick a reasonable default and spend your tuning effort elsewhere.

A reasonable rule of thumb if you are starting a new project from scratch: use ReLU until you have a working baseline, then if you are training a Transformer or a generative model, switch to GELU or Swish and check whether it helps. If you are seeing many dead units during the early phase of training, swap in Leaky ReLU. If you are operating in a setting where batch normalisation is impractical, ELU or SELU are worth a look. None of these substitutions should be expected to deliver more than a percentage point or two on a well-tuned baseline, and each carries a small risk of slowing down training while you re-tune the learning rate to suit it. The early years of deep learning produced a great deal of activation-function tinkering; the modern consensus is that the field has converged to a small handful of safe choices and the productive work lies elsewhere, in data, architecture, and scale, rather than in further activation-function search.

What you should take away

  1. Without a non-linear activation between layers, any deep network collapses algebraically into a single linear layer. The activation is the only thing that makes depth meaningful.
  2. Sigmoid and tanh saturate in their tails, producing tiny gradients that vanish across deep stacks. Use sigmoid only at binary outputs and tanh only inside older recurrent cells.
  3. ReLU is the modern default for hidden layers: cheap, sparse, free of saturation on the active side, but vulnerable to dead units that can be mitigated with Leaky ReLU, PReLU, or ELU.
  4. GELU and Swish are smooth, self-gated activations that dominate Transformers and diffusion models; their differences from one another are usually within training noise.
  5. Softmax converts a vector of real numbers into a probability distribution and belongs only at the output of a multi-class classifier, never as a hidden-layer activation, because it couples units together.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).