Glossary

Tanh

The hyperbolic tangent is a smooth, S-shaped activation function defined as

$$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{\sinh(x)}{\cosh(x)} = 2\sigma(2x) - 1,$$

where $\sigma$ is the logistic sigmoid. It maps the real line to the open interval $(-1, 1)$, with $\tanh(0) = 0$, $\tanh(x) \to 1$ as $x \to \infty$, and $\tanh(x) \to -1$ as $x \to -\infty$. Its derivative,

$$\tanh'(x) = 1 - \tanh^2(x),$$

is bounded above by $1$ at the origin and decays rapidly toward zero as $|x|$ grows.

Relation to the sigmoid

Tanh is a rescaled and shifted sigmoid. Where the logistic sigmoid produces values in $(0, 1)$ centred at $0.5$, tanh produces values in $(-1, 1)$ centred at $0$. This zero-centring matters in practice: when activations are systematically positive, the gradient with respect to weights in the next layer also has a consistent sign, which biases stochastic gradient descent toward zigzag updates. Tanh's zero-mean output mitigates this and was, throughout the 1990s and 2000s, a primary reason it was preferred over the sigmoid in hidden layers (LeCun et al., Efficient BackProp, 1998).

Vanishing gradients

Like the sigmoid, tanh saturates: outside roughly $|x| > 3$, $\tanh'(x)$ is essentially zero. In a deep network this causes vanishing gradients , the chain rule multiplies many small derivatives together, exponentially attenuating the error signal that reaches the early layers. This was a central obstacle to training networks with more than a few layers prior to the introduction of ReLU (Glorot, Bordes & Bengio, 2011), which has a constant gradient of $1$ on its active half-line.

Modern roles

In feed-forward and convolutional networks, tanh has been almost entirely displaced by ReLU, GELU, and SiLU. It nonetheless survives in several settings:

  • LSTM cell-state updates, the candidate cell-state contribution $\tilde c_t = \tanh(W_c x_t + U_c h_{t-1})$ uses tanh to keep the additive update to the cell state bounded in $(-1, 1)$, which is essential for the additive recurrence to remain stable over long sequences.
  • GRU candidate states, the same role, ensuring the candidate hidden state cannot grow without bound.
  • Output activations, when a regression target is naturally bounded in $(-1, 1)$, for example a normalised image pixel or the output of a generator in an image-to-image GAN.
  • Continuous control, policy networks for continuous-action reinforcement learning frequently squash the action mean through tanh so that actions lie in a bounded box.

Tanh's enduring presence in recurrent architectures reflects the importance of bounded, zero-centred updates for stability; its retreat from feed-forward layers reflects the priority deep learning now places on avoiding vanishing gradients in very deep stacks.

Related terms: Sigmoid Function, ReLU, Activation Function, LSTM, GRU

Discussed in:

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).