12.9 The gated recurrent unit

The GRU Cho, 2014 simplifies the LSTM by merging the cell and hidden states and reducing the number of gates to two. Concatenate $u_t = [h_{t-1}; x_t]$ as before. The GRU equations are

$$ \begin{aligned} z_t &= \sigma\!\left(W_z u_t + b_z\right) && \text{update gate} \\ r_t &= \sigma\!\left(W_r u_t + b_r\right) && \text{reset gate} \\ \tilde h_t &= \tanh\!\left(W_h \begin{bmatrix} r_t \odot h_{t-1} \\ x_t \end{bmatrix} + b_h\right) && \text{candidate hidden state} \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t && \text{hidden state update} \end{aligned} $$

The update gate $z_t$ plays the role of both forget and input gate in the LSTM: when $z_t \approx 0$ the previous state is kept; when $z_t \approx 1$ the candidate is taken. The reset gate $r_t$ controls how much of the previous hidden state contributes to the candidate: when $r_t \approx 0$ the candidate is computed largely from $x_t$ alone, effectively forgetting the past for the purpose of constructing the new candidate.

The GRU has three weight matrices instead of four, hence about 75% of the parameters and FLOPs of an LSTM with the same hidden size. Empirically, GRUs and LSTMs perform similarly across most tasks; published comparisons (Chung et al. 2014; Greff et al. 2017) find no consistent winner. The choice between them is usually made by cross-validation. Both can be stacked into multi-layer networks (deep RNNs) and run bidirectionally.

The gradient analysis for the GRU mirrors the LSTM. Treating $z_t$ as a constant,

$$\frac{\partial h_t}{\partial h_{t-1}} = \mathrm{diag}(1 - z_t) + \mathrm{diag}(z_t) \cdot \frac{\partial \tilde h_t}{\partial h_{t-1}}.$$

The first term is the "skip path": when $z_t \approx 0$ the gradient flows through almost unchanged. As with the LSTM, the network can learn to keep $z_t$ small for long-term-memory dimensions, granting them a near-identity gradient path.

A subtle but consequential difference between GRU and LSTM is that the GRU exposes its full state $h_t$ as both the recurrent feedback and the layer output, whereas the LSTM keeps $c_t$ internal and exposes only $h_t = o_t \odot \tanh(c_t)$. The LSTM thus has an extra degree of freedom: it can hold information internally without exposing it, which can be useful for tasks where the model wants to remember without acting yet. On translation and language modelling this rarely matters; on more structured reasoning tasks the LSTM's separation can be advantageous.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).