Chapter Twelve

Sequence Models

Learning Objectives
  1. Describe the recurrent neural network (RNN) model and the vanishing/exploding gradient problem
  2. Explain the gating mechanism of LSTMs and GRUs and how it preserves long-range dependencies
  3. Build encoder–decoder sequence-to-sequence models for tasks such as machine translation
  4. Compare word embedding techniques (Word2Vec, GloVe, FastText) and what they capture semantically
  5. Define n-gram, neural, and large language models and compute cross-entropy and perplexity

A sentence is a sequence of words. A spoken utterance is a sequence of acoustic frames. A video is a sequence of images. An electrocardiogram is a sequence of voltages. The genome is a sequence of bases. A great deal of what we want machines to understand and produce takes the form of sequences whose length is not known in advance and whose elements are not independent: the meaning of any one element is bound up with what came before, and sometimes what will come after.

Feed-forward networks of the kind we built in Chapter 9 are unsuited to this. They take an input of fixed dimension, run it through a fixed pipeline of layers, and produce an output of fixed dimension. If we tried to use one to read a sentence, we would have to pad or truncate every sentence to the same length, and the network would have no way to know that the word "running" two positions back is the same kind of object as the word "running" four positions back, every position would be a separate set of weights with no shared structure. We need a model that processes elements one at a time, that uses the same parameters at every position, and that maintains some internal state to carry information across positions.

This chapter develops that model from first principles. We begin with the formal setting of discrete-sequence modelling and the chain-rule decomposition that makes language a probabilistic problem. We work through the classical n-gram language model and its limitations, then introduce continuous word representations, the bridge between discrete tokens and the world of dot products and gradients. We cover modern subword tokenisation, then the recurrent neural network in full mathematical detail, including a derivation of backpropagation through time and the spectral analysis that reveals why simple recurrent networks fail at long-range dependencies. We then build the gated architectures, LSTM and GRU, that solved this problem and ruled neural NLP from roughly 2014 to 2017. Bidirectional and stacked variants follow, then the encoder–decoder framework, the Bahdanau attention mechanism that broke its information bottleneck, the connectionist temporal classification loss for unaligned sequence outputs, and the family of decoding strategies, greedy, beam, top-$k$, top-$p$, temperature, that are used to sample from any sequence model at test time. Finally we build a character-level LSTM in PyTorch from scratch and use it to motivate the move to Transformers in the next chapter.

In this chapter

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.