Summary
We have built up sequence modelling from first principles: the variable-length, order-sensitive, long-dependency setting that defeats feed-forward networks; the chain-rule decomposition of language modelling; the n-gram baseline and its sparsity problem; the move to dense word embeddings via word2vec, GloVe and FastText; subword tokenisation via BPE, WordPiece and Unigram. The recurrent neural network gave us a model with weight-sharing across time and a hidden state acting as memory, but its vanilla form suffered the vanishing-gradient problem analysed via the spectral radius of the step-Jacobian. The LSTM and GRU added gating, allowing the cell state (or hidden state) to act as a near-linear gradient highway and pushing usable dependency length to hundreds of steps. Bidirectional and stacked variants extended the framework. The seq2seq encoder–decoder applied recurrent networks to mappings between sequences of different lengths but suffered a fixed-context bottleneck, solved by attention, which produced soft alignments and removed the bottleneck in a single conceptual move. CTC handled unaligned outputs for speech. Beam search, top-$k$, top-$p$ and temperature sampling are the standard decoding strategies for any sequence model. We built a character-level LSTM in PyTorch and saw it produce locally fluent but globally incoherent text, one motivation among many for the Transformer architecture, the subject of Chapter 13.