Also known as: RNN, vanilla RNN
A recurrent neural network (RNN) processes a sequence $(x_1, x_2, \ldots, x_T)$ by maintaining a hidden state that is updated at each timestep. The vanilla RNN recurrence is
$$h_t = \phi(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$
$$y_t = W_{hy} h_t + b_y$$
with $\phi$ a non-linearity (typically $\tanh$). The hidden state $h_t$ acts as a compressed summary of the input prefix $x_{1:t}$.
Training uses backpropagation through time (BPTT): the network is unrolled into a deep feed-forward chain of $T$ copies sharing weights, and standard backpropagation gives gradients
$$\frac{\partial \mathcal{L}}{\partial W_{hh}} = \sum_t \frac{\partial \mathcal{L}_t}{\partial h_t} \frac{\partial h_t}{\partial W_{hh}}$$
where each $\partial h_t / \partial W_{hh}$ requires summing through all earlier timesteps via the chain rule:
$$\frac{\partial h_t}{\partial h_{t-k}} = \prod_{i=t-k+1}^{t} \frac{\partial h_i}{\partial h_{i-1}} = \prod_{i=t-k+1}^{t} \mathrm{diag}(\phi'(z_i)) W_{hh}$$
The vanishing gradient problem is direct: this product of $k$ Jacobians shrinks exponentially with $k$ when the spectral radius of $W_{hh} \cdot \mathrm{diag}(\phi')$ is less than 1, and explodes when it exceeds 1. Vanilla RNNs cannot effectively learn dependencies beyond ~10–20 timesteps.
LSTM and GRU address this with explicit gating mechanisms that allow gradients to flow through additive paths.
Bidirectional RNNs run two RNNs in opposite directions and concatenate their hidden states, useful when the entire sequence is available at inference time (text classification, NER). Deep RNNs stack multiple recurrent layers.
RNNs were the dominant sequence model from roughly 1985 to 2018. They were displaced by Transformers for large-scale language modelling because Transformers parallelise across positions during training (where RNNs are inherently sequential), making them dramatically more efficient on modern hardware. RNNs remain useful in resource-constrained settings, in streaming/online applications where the full sequence is not available, and as the conceptual ancestor of state-space models like Mamba.
Truncated BPTT is the standard practical technique: backpropagate gradients for only a limited window (typically 32–512 timesteps) rather than the full sequence, trading exact gradients for tractable memory.
Interactive
Video
Related terms: LSTM, GRU, Vanishing Gradient Problem, Backpropagation, Mamba
Discussed in:
- Chapter 12: Sequence Models, Sequence Models