Sequence Models: 12.9   The gated recurrent unit

Dr Chris Paton

12.9 The gated recurrent unit

The GRU Cho, 2014 simplifies the LSTM by merging the cell and hidden states and reducing the number of gates to two. Concatenate $u_t = [h_{t-1}; x_t]$ as before. The GRU equations are

$$ \begin{aligned} z_t &= \sigma\!\left(W_z u_t + b_z\right) && \text{update gate} \\ r_t &= \sigma\!\left(W_r u_t + b_r\right) && \text{reset gate} \\ \tilde h_t &= \tanh\!\left(W_h \begin{bmatrix} r_t \odot h_{t-1} \\ x_t \end{bmatrix} + b_h\right) && \text{candidate hidden state} \\ h_t &= (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t && \text{hidden state update} \end{aligned} $$

The update gate $z_t$ plays the role of both forget and input gate in the LSTM: when $z_t \approx 0$ the previous state is kept; when $z_t \approx 1$ the candidate is taken. The reset gate $r_t$ controls how much of the previous hidden state contributes to the candidate: when $r_t \approx 0$ the candidate is computed largely from $x_t$ alone, effectively forgetting the past for the purpose of constructing the new candidate.

The GRU has three weight matrices instead of four, hence about 75% of the parameters and FLOPs of an LSTM with the same hidden size. Empirically, GRUs and LSTMs perform similarly across most tasks; published comparisons (Chung et al. 2014; Greff et al. 2017) find no consistent winner. The choice between them is usually made by cross-validation. Both can be stacked into multi-layer networks (deep RNNs) and run bidirectionally.

The gradient analysis for the GRU mirrors the LSTM. Treating $z_t$ as a constant,

$$\frac{\partial h_t}{\partial h_{t-1}} = \mathrm{diag}(1 - z_t) + \mathrm{diag}(z_t) \cdot \frac{\partial \tilde h_t}{\partial h_{t-1}}.$$

The first term is the "skip path": when $z_t \approx 0$ the gradient flows through almost unchanged. As with the LSTM, the network can learn to keep $z_t$ small for long-term-memory dimensions, granting them a near-identity gradient path.

A subtle but consequential difference between GRU and LSTM is that the GRU exposes its full state $h_t$ as both the recurrent feedback and the layer output, whereas the LSTM keeps $c_t$ internal and exposes only $h_t = o_t \odot \tanh(c_t)$. The LSTM thus has an extra degree of freedom: it can hold information internally without exposing it, which can be useful for tasks where the model wants to remember without acting yet. On translation and language modelling this rarely matters; on more structured reasoning tasks the LSTM's separation can be advantageous.