Tanh, Glossary, Textbook of AI

The hyperbolic tangent is a smooth, S-shaped activation function defined as

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{\sinh(x)}{\cosh(x)} = 2\sigma(2x) - 1,$$

where $\sigma$ is the logistic sigmoid. It maps the real line to the open interval $(-1, 1)$, with $\tanh(0) = 0$, $\tanh(x) \to 1$ as $x \to \infty$, and $\tanh(x) \to -1$ as $x \to -\infty$. Its derivative,

$$\tanh'(x) = 1 - \tanh^2(x),$$

is bounded above by $1$ at the origin and decays rapidly toward zero as $|x|$ grows.

Relation to the sigmoid

Tanh is a rescaled and shifted sigmoid. Where the logistic sigmoid produces values in $(0, 1)$ centred at $0.5$, tanh produces values in $(-1, 1)$ centred at $0$. This zero-centring matters in practice: when activations are systematically positive, the gradient with respect to weights in the next layer also has a consistent sign, which biases stochastic gradient descent toward zigzag updates. Tanh's zero-mean output mitigates this and was, throughout the 1990s and 2000s, a primary reason it was preferred over the sigmoid in hidden layers (LeCun et al., Efficient BackProp, 1998).

Vanishing gradients

Like the sigmoid, tanh saturates: outside roughly $|x| > 3$, $\tanh'(x)$ is essentially zero. In a deep network this causes vanishing gradients , the chain rule multiplies many small derivatives together, exponentially attenuating the error signal that reaches the early layers. This was a central obstacle to training networks with more than a few layers prior to the introduction of ReLU (Glorot, Bordes & Bengio, 2011), which has a constant gradient of $1$ on its active half-line.

Modern roles

In feed-forward and convolutional networks, tanh has been almost entirely displaced by ReLU, GELU, and SiLU. It nonetheless survives in several settings:

LSTM cell-state updates, the candidate cell-state contribution $\tilde c_t = \tanh(W_c x_t + U_c h_{t-1})$ uses tanh to keep the additive update to the cell state bounded in $(-1, 1)$, which is essential for the additive recurrence to remain stable over long sequences.
GRU candidate states, the same role, ensuring the candidate hidden state cannot grow without bound.
Output activations, when a regression target is naturally bounded in $(-1, 1)$, for example a normalised image pixel or the output of a generator in an image-to-image GAN.
Continuous control, policy networks for continuous-action reinforcement learning frequently squash the action mean through tanh so that actions lie in a bounded box.

Tanh's enduring presence in recurrent architectures reflects the importance of bounded, zero-centred updates for stability; its retreat from feed-forward layers reflects the priority deep learning now places on avoiding vanishing gradients in very deep stacks.

Related terms: Sigmoid Function, ReLU, Activation Function, LSTM, GRU

Discussed in:

Chapter 6: ML Fundamentals, Activation functions

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).