Each squashes a real number through a nonlinear curve. The choice shapes how gradients flow.
From the chapter: Chapter 9: Neural Networks
Glossary: activation function, relu, sigmoid, gelu
Transcript
Four activation functions, plotted on the same axes.
The sigmoid squashes any real number into the interval zero to one. Smooth. Saturates at the extremes. Its derivative is at most one quarter, and near zero away from the centre.
Tanh, the hyperbolic tangent. Same shape but centred at zero, ranging from minus one to plus one. Slightly steeper than sigmoid.
ReLU, the rectified linear unit. Zero for negative inputs, the input itself for positive inputs. A simple bend at the origin. Cheap to compute, gradient of one or zero.
GELU, the smooth cousin of ReLU. Like ReLU for large positive inputs, gently curving down to zero for large negative inputs.
The shape of an activation matters because gradients flow through its derivative.
Sigmoid and tanh saturate. In their flat tails, the derivative is near zero. Stack a hundred of them and the gradient at the bottom is essentially nothing. The vanishing gradient problem.
ReLU keeps a derivative of one in the active region. Gradients pass through unchanged. This is why ReLU was the breakthrough that made very deep networks trainable.
GELU and SiLU and Swish smooth out the kink for transformers. Small detail, real difference.