- Describe the recurrent neural network (RNN) model and the vanishing/exploding gradient problem
- Explain the gating mechanism of LSTMs and GRUs and how it preserves long-range dependencies
- Build encoder–decoder sequence-to-sequence models for tasks such as machine translation
- Compare word embedding techniques (Word2Vec, GloVe, FastText) and what they capture semantically
- Define n-gram, neural, and large language models and compute cross-entropy and perplexity
Language is a sequence of words. Speech is a sequence of sound frames. Stock prices are a sequence of numbers. Biology runs on sequences of molecules. Unlike images, which have a fixed size, sequences vary in length and can have dependencies that stretch across hundreds of steps — the meaning of a word may hinge on another word far earlier in the sentence.
Standard feedforward networks cannot handle this. They take fixed-size inputs and produce fixed-size outputs. Sequence models fix this by processing inputs of any length and maintaining a form of memory over time.
This chapter covers RNNs, gated architectures (LSTMs and GRUs), the encoder–decoder framework, word embeddings, and language models — the essential background for the attention and Transformer chapter that follows.
12.1 Recurrent Neural Networks
An RNN processes a sequence one step at a time, carrying a hidden state that summarises what it has seen so far. At each time step t:
ht = f(Whh ht−1 + Wxh xt + b)
where f is typically tanh. The same weights are used at every step (weight sharing across time), so the number of parameters is fixed regardless of sequence length.
The hidden state acts as the network's memory. In principle, ht encodes information from the entire history x1, …, xt. An output yt can be read from the hidden state at any step.
Training: Backpropagation Through Time
Training "unrolls" the RNN into a deep feedforward network with shared weights and applies the chain rule backwards. The total gradient is the sum of contributions from each time step.
The Vanishing Gradient Problem
Gradients must flow through many multiplicative steps. They tend to either vanish (shrink to near zero) or explode (grow without bound). Vanishing gradients mean the network cannot learn long-range dependencies — the error signal from a distant target dies out before reaching the relevant input. Gradient clipping (rescaling the gradient if its norm exceeds a threshold) fixes explosion but not vanishing. Fixing vanishing gradients requires changing the architecture.
When Simple RNNs Work
For short-range dependencies, simple RNNs are fast and effective. They also introduced key patterns that carry forward: bidirectional processing, stacking multiple layers, and teacher forcing.
12.2 LSTMs & GRUs
LSTM
The Long Short-Term Memory network Hochreiter, 1997 was designed to fix the vanishing gradient problem. Its key idea: add a cell state — a dedicated memory channel that runs through the sequence with only linear interactions. This creates a "gradient highway" where error signals can travel hundreds of steps without fading.
Three gates control what gets remembered, written, and read:
- Forget gate: f
t= σ(Wf[ht−1, xt] + bf). Decides what to erase from the cell state. - Input gate: i
t= σ(Wi[ht−1, xt] + bi). Decides what new information to write. - Candidate update: c̃
t= tanh(Wc[ht−1, xt] + bc). - New cell state: c
t= ft⊙ ct−1+ it⊙ c̃t. - Output gate: o
t= σ(Wo[ht−1, xt] + bo). Decides what to expose as the hidden state. - Hidden state: h
t= ot⊙ tanh(ct).
The cell state update uses only element-wise operations and addition — no matrix multiplication. The gradient of ct with respect to ct−1 is just the forget gate value, which the network can learn to hold near 1 when long-term memory is needed. In practice, LSTMs learn dependencies spanning hundreds of steps, compared to 10–20 for simple RNNs.
GRU
The Gated Recurrent Unit Cho, 2014 simplifies the LSTM by merging the cell state and hidden state into one vector and using two gates instead of three:
- Update gate z
t: controls how much of the old state to keep. - Reset gate r
t: controls how much of the old state feeds into the candidate.
Equations: zt = σ(Wz [ht−1, xt]), rt = σ(Wr [ht−1, xt]), h̃t = tanh(W [rt ⊙ ht−1, xt]), ht = (1 − zt) ⊙ ht−1 + zt ⊙ h̃t.
LSTM vs GRU
Neither consistently wins. GRUs are faster (fewer parameters). LSTMs may have a slight edge on tasks needing very long memory. The choice is usually made by cross-validation. Both can be stacked (deep RNNs) and run in both directions (bidirectional RNNs).
Historical Context
LSTMs and GRUs dominated from ~2014 to ~2017, setting records in translation, speech, and language modelling. They established that explicit gating is essential for long-range dependencies — a principle carried forward into attention and Transformers. Recent innovations like state-space models (S4, Mamba) draw clear lineage from the LSTM's ideas about gating and linear state propagation.
12.3 Sequence-to-Sequence Models
Many important tasks map one variable-length sequence to another: translation, summarisation, speech-to-text. The seq2seq framework [Sutskever, 2014; Cho, 2014] handles this with two components:
- Encoder: reads the input and compresses it into a fixed-length context vector c.
- Decoder: generates the output one token at a time, conditioned on c.
Both are typically LSTMs or GRUs. The decoder uses teacher forcing during training: at each step, it receives the ground-truth previous token (not its own prediction).
The Bottleneck
Compressing an entire input sequence into a single vector works for short sentences. For long inputs, information is inevitably lost — especially about early parts of the sequence. Cho et al. (2014) Cho, 2014 showed that performance degrades sharply with input length. This directly motivated the attention mechanism (Chapter 13), which lets the decoder look back at all encoder states at each step.
Decoding Strategies
- Greedy decoding: pick the highest-probability token each step. Fast but myopic.
- Beam search: maintain B candidate sequences and expand all of them. With beam width 4–10, outputs are substantially better. Length normalisation and coverage penalties prevent the search from favouring overly short outputs.
Impact
Seq2seq achieved the first competitive neural translation results and was adapted for summarisation, dialogue, image captioning (CNN encoder), and speech recognition. Key training techniques became standard: teacher forcing, scheduled sampling (gradually replacing ground-truth inputs with the model's own predictions during training, to close the train–test gap), and attention-based alignment. The Transformer has since replaced recurrent seq2seq, but the encoder–decoder paradigm itself persists in models like T5 and BART.
12.4 Word Embeddings
Before a sequence model can process language, discrete words must become continuous vectors.
One-Hot Encoding
The simplest approach: a binary vector of length V (vocabulary size) with a single 1. This is sparse, high-dimensional, and tells you nothing about word relationships. The distance between "cat" and "dog" equals the distance between "cat" and "democracy."
Word2Vec
Mikolov et al. (2013) Mikolov, 2013 introduced Word2Vec, which learns dense vectors (100–300 dimensions) from large text corpora. Two training objectives:
- CBOW: predict a word from its surrounding context.
- Skip-gram: predict the context from a word.
The resulting vectors capture semantic relationships as geometric directions. The famous example: vec("king") − vec("man") + vec("woman") ≈ vec("queen").
GloVe
GloVe Pennington, 2014 takes a global approach: build a word–word co-occurrence matrix from the corpus and learn embeddings by factorising it. The dot product of two vectors approximates the log of their co-occurrence count. Results are comparable to Word2Vec.
FastText
FastText Bojanowski, 2017 extends Word2Vec by representing each word as a bag of character n-grams (typically length 3–6). The embedding is the sum of the n-gram embeddings. Two key advantages:
- Out-of-vocabulary words: FastText can build embeddings for words never seen in training by summing their n-grams.
- Morphological sharing: "running," "runner," and "runs" share n-grams and get similar vectors, without any explicit grammar rules.
This is especially valuable for morphologically rich languages like Finnish, Turkish, and Arabic.
Contextualised Embeddings
Word2Vec, GloVe, and FastText give each word a single vector regardless of context. "Bank" gets the same vector whether it means a financial institution or a river bank.
ELMo Peters, 2018 fixed this by running a bidirectional LSTM over the input and producing different embeddings depending on the sentence. BERT Devlin, 2019 and later Transformer models produce even richer contextualised representations. Modern NLP uses contextualised embeddings from pretrained models as standard.
Bias in Embeddings
Embeddings trained on real text inherit societal biases. Occupation words may be closer to one gender vector than the other. Debiasing techniques exist (projecting out the bias direction, training on balanced data), but the problem is not fully solved. Practitioners must be aware of this, especially in high-stakes applications.
12.5 Language Models
A language model assigns a probability to a sequence of tokens:
P(w1, …, wT) = Πt=1^T^ P(wt | w1, …, wt−1)
Each conditional is the probability of the next token given everything before it. The model is trained to predict the next word at each position.
N-Gram Models
The oldest approach: estimate the probability of the next word from its frequency given the preceding n − 1 words. Bigrams (n = 2) and trigrams (n = 3) were workhorses for decades. But as n grows, the number of possible n-grams explodes and most are never seen in the data. Smoothing (e.g., Kneser-Ney) redistributes probability mass, but the limits of fixed-context, count-based estimation remain.
Neural Language Models
Bengio et al. (2003) Bengio, 2003 replaced n-gram tables with a neural network: concatenate the embeddings of the previous n − 1 words, pass through a hidden layer, and softmax over the vocabulary. This generalises to unseen word combinations — if "cat sat on the mat" is probable, "dog sat on the rug" should be too, because the embeddings are close. But the context window is still fixed.
Recurrent Language Models
RNN-based language models (RNNLMs) removed the fixed context. The hidden state encodes the full history. LSTM-based RNNLMs cut perplexity (the standard metric — the exponential of average negative log-likelihood per token) on Penn Treebank from over 140 (well-tuned trigram) to below 60. Techniques like weight tying (sharing the input embedding and output projection matrices) and variational dropout (applying the same dropout mask at every time step, rather than resampling) further improved results.
Transformer Language Models
The Transformer Vaswani, 2017 (Chapter 13) has replaced RNNs as the dominant architecture. Models like GPT, GPT-2, GPT-3 Brown, 2020 scale to billions of parameters on trillions of tokens. Scaling laws Kaplan, 2020 show that performance improves as a smooth power law of model size, data, and compute. Despite this shift, the core task is the same as Bengio's 2003 model: predict the next token.
From Tool to Foundation
Language models have evolved from a component in speech and translation systems to the central artefact of modern AI. Sufficiently large models acquire emergent capabilities Wei, 2022: arithmetic, common-sense reasoning, in-context learning Brown, 2020. Instruction tuning and RLHF Ouyang, 2022 refine them into capable assistants.
The fundamental principle: by learning to predict the next word, a model acquires a rich representation of the structure, knowledge, and reasoning patterns in its training data. Understanding the path from n-grams to RNNLMs to Transformers is essential for grasping both the power and the limits of today's large language models.