12.10 Bidirectional and stacked RNNs
So far we have processed sequences strictly left-to-right. For tasks that are not generative (part-of-speech tagging, named-entity recognition, sentiment classification) there is no causal constraint, and we can let the representation at each position depend on both past and future. A bidirectional RNN runs two independent RNNs, one left-to-right and one right-to-left, and concatenates their hidden states:
$$\overrightarrow h_t = \mathrm{RNN}_{\rightarrow}(x_t, \overrightarrow h_{t-1}), \qquad \overleftarrow h_t = \mathrm{RNN}_{\leftarrow}(x_t, \overleftarrow h_{t+1}), \qquad h_t = [\overrightarrow h_t; \overleftarrow h_t].$$
The bidirectional hidden state $h_t$ at position $t$ now encodes the entire sequence, viewed from both directions. ELMo (Peters et al. 2018) 2018 used a stack of bidirectional LSTMs to produce contextualised word embeddings: the embedding of a word in context was a learned weighted sum of the hidden states at each layer.
A deep RNN stacks multiple recurrent layers, feeding the output of layer $\ell$ as the input of layer $\ell + 1$. Each layer has its own parameters. Empirically, two to four layers is the sweet spot; deeper stacks tend to overfit and to have optimisation pathologies of their own. Residual connections and layer normalisation (which we meet in Chapter 13 in the Transformer context) help stabilise deeper RNNs.
12.10.1 Layer normalisation in RNNs
Batch normalisation, which works well for CNNs, is awkward for RNNs because it would couple statistics across time steps and make the recurrent computation depend on batch composition. Layer normalisation (Ba, Kiros, and Hinton 2016) computes its statistics across features within a single example, not across the batch:
$$\mathrm{LN}(x) = \gamma \odot \frac{x - \mu(x)}{\sigma(x)} + \beta, \qquad \mu(x) = \frac{1}{H} \sum_i x_i, \qquad \sigma(x) = \sqrt{\frac{1}{H} \sum_i (x_i - \mu(x))^2 + \epsilon}.$$
Inside an LSTM, the standard place to insert layer norm is on the pre-activation of each gate: replace $W u_t$ with $\mathrm{LN}(W u_t)$. This stabilises training of deep stacked LSTMs and is essential to most modern recurrent architectures. Layer normalisation is also a cornerstone of the Transformer (Chapter 13).
12.10.2 ELMo and the rise of contextualised embeddings
The combination of bidirectional LSTMs with the language-modelling objective produced ELMo (Embeddings from Language Models, Peters et al. 2018) 2018. Two independent LSTMs are trained on raw text, one left-to-right, one right-to-left, each predicting the next token in its direction. After training, for any input sentence the hidden states from each layer of each direction form a stack of representations. ELMo defines the contextual embedding of position $t$ as a learned weighted sum of these representations, with task-specific weights tuned on each downstream task.
ELMo demonstrated decisively that pretraining on a language-modelling objective produces representations that transfer to many downstream tasks (named-entity recognition, coreference resolution, sentiment, question answering), often by simply concatenating the ELMo vectors with task-specific embeddings. This methodology (pretrain large, fine-tune small) became the dominant pattern for the next several years and motivated the move to Transformer-based pretraining (BERT, GPT) in Chapter 13.