The hyperbolic tangent is a smooth, S-shaped activation function defined as
$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{\sinh(x)}{\cosh(x)} = 2\sigma(2x) - 1,$$
where $\sigma$ is the logistic sigmoid. It maps the real line to the open interval $(-1, 1)$, with $\tanh(0) = 0$, $\tanh(x) \to 1$ as $x \to \infty$, and $\tanh(x) \to -1$ as $x \to -\infty$. Its derivative,
$$\tanh'(x) = 1 - \tanh^2(x),$$
is bounded above by $1$ at the origin and decays rapidly toward zero as $|x|$ grows.
Relation to the sigmoid
Tanh is a rescaled and shifted sigmoid. Where the logistic sigmoid produces values in $(0, 1)$ centred at $0.5$, tanh produces values in $(-1, 1)$ centred at $0$. This zero-centring matters in practice: when activations are systematically positive, the gradient with respect to weights in the next layer also has a consistent sign, which biases stochastic gradient descent toward zigzag updates. Tanh's zero-mean output mitigates this and was, throughout the 1990s and 2000s, a primary reason it was preferred over the sigmoid in hidden layers (LeCun et al., Efficient BackProp, 1998).
Vanishing gradients
Like the sigmoid, tanh saturates: outside roughly $|x| > 3$, $\tanh'(x)$ is essentially zero. In a deep network this causes vanishing gradients , the chain rule multiplies many small derivatives together, exponentially attenuating the error signal that reaches the early layers. This was a central obstacle to training networks with more than a few layers prior to the introduction of ReLU (Glorot, Bordes & Bengio, 2011), which has a constant gradient of $1$ on its active half-line.
Modern roles
In feed-forward and convolutional networks, tanh has been almost entirely displaced by ReLU, GELU, and SiLU. It nonetheless survives in several settings:
- LSTM cell-state updates, the candidate cell-state contribution $\tilde c_t = \tanh(W_c x_t + U_c h_{t-1})$ uses tanh to keep the additive update to the cell state bounded in $(-1, 1)$, which is essential for the additive recurrence to remain stable over long sequences.
- GRU candidate states, the same role, ensuring the candidate hidden state cannot grow without bound.
- Output activations, when a regression target is naturally bounded in $(-1, 1)$, for example a normalised image pixel or the output of a generator in an image-to-image GAN.
- Continuous control, policy networks for continuous-action reinforcement learning frequently squash the action mean through tanh so that actions lie in a bounded box.
Tanh's enduring presence in recurrent architectures reflects the importance of bounded, zero-centred updates for stability; its retreat from feed-forward layers reflects the priority deep learning now places on avoiding vanishing gradients in very deep stacks.
Related terms: Sigmoid Function, ReLU, Activation Function, LSTM, GRU
Discussed in:
- Chapter 6: ML Fundamentals, Activation functions