The gated recurrent unit (GRU), introduced by Kyunghyun Cho et al. in 2014, is a simplification of the LSTM that uses two gates instead of three. It computes:
$$r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r) \quad \text{(reset gate)}$$
$$z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z) \quad \text{(update gate)}$$
$$\tilde h_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h)$$
$$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t$$
The reset gate $r_t$ controls how much past hidden state contributes to the candidate update. The update gate $z_t$ acts as a single combined input/forget gate, interpolating between the previous state and the candidate.
GRU's parameter count per layer is $3(d_h^2 + d_x d_h + d_h)$, three gate-style transformations rather than LSTM's four, making it about 25% smaller than an equivalent LSTM. Empirically, GRU and LSTM perform comparably on most tasks, with GRU sometimes preferred for smaller models and LSTM for very long sequences.
The merged input/forget gate has an interesting property: the cell state interpolates between past and present rather than independently retaining and adding. The forget and input gates of LSTM are coupled in GRU, you cannot simultaneously remember the old and add fresh information independently.
Both GRU and LSTM have been substantially displaced by Transformers for large-scale modelling but remain widely used in production (especially for streaming, on-device, and resource-constrained applications) and as the conceptual basis for modern state-space models.
Related terms: LSTM, Recurrent Neural Network, Vanishing Gradient Problem
Discussed in:
- Chapter 12: Sequence Models, Sequence Models