Sequence Models: 12.11   Sequence-to-sequence models

Dr Chris Paton

12.11 Sequence-to-sequence models

Many of the most useful sequence tasks map an input sequence to an output sequence of a different length: machine translation (English sentence → French sentence), summarisation (article → summary), speech recognition (audio frames → text), question answering, dialogue. The sequence-to-sequence (seq2seq) framework, introduced almost simultaneously by Sutskever, Vinyals and Le 2014 and Cho et al. 2014, handles all of these with a single recipe.

12.11.1 The encoder–decoder architecture

Use two RNNs.

The encoder reads the input sequence $x_{1:T}$ and produces a sequence of hidden states $h_1^{\mathrm{enc}}, \ldots, h_T^{\mathrm{enc}}$. A single context vector $c$, typically the final encoder hidden state $h_T^{\mathrm{enc}}$, is taken as a summary of the input.

The decoder is a conditional language model that generates the output sequence $y_{1:T'}$ token by token, conditioned on $c$. It maintains its own hidden state $s_t$, initialised from $c$, and at each step consumes the previously generated token $y_{t-1}$ (or the ground-truth token at training time, see teacher forcing below) to produce the next:

$$s_t = \mathrm{RNN}^{\mathrm{dec}}(y_{t-1}, s_{t-1}), \qquad P(y_t \mid y_{\lt t}, x) = \mathrm{softmax}(W_{\mathrm{out}} s_t + b_{\mathrm{out}}).$$

Generation begins with a $\langle\text{BOS}\rangle$ token and continues until the model emits $\langle\text{EOS}\rangle$ or a maximum length is reached.

12.11.2 Teacher forcing and scheduled sampling

At training time, the decoder is given the ground-truth previous token $y_{t-1}^*$ rather than its own prediction. This is teacher forcing, and it converts an otherwise daunting exposure-to-its-own-errors training problem into a series of per-step classification problems. The drawback is the exposure bias between training (always perfect inputs) and inference (errors compound from the model's own previous outputs).

Scheduled sampling (Bengio et al. 2015) 2015 addresses this by mixing the two regimes during training: at each step, with some probability $\epsilon_t$ that decays over training, feed the model's own previous prediction instead of the ground truth. Empirically, this can improve generalisation, though theoretical concerns remain about the bias it introduces into the gradient.

12.11.3 The information bottleneck

Compressing an entire input sequence into a single fixed-dimensional vector $c$ works for short inputs. For long inputs (in machine translation, sentences of 30 words and beyond) Cho et al. 2014 showed that BLEU scores degrade sharply as sentence length grows. Information about early input tokens has to survive a long passage through the encoder RNN and then be retained in $c$ while the decoder generates a possibly longer output. Vanishing-gradient effects bite the encoder; the cell state has finite capacity, and the bottleneck is real.

This empirical observation directly motivated the attention mechanism, which removes the bottleneck by allowing the decoder to look back at the entire sequence of encoder hidden states at every step.

12.11.4 Sutskever's reverse-input trick

Sutskever, Vinyals and Le 2014 reported a curious empirical finding: reversing the input sequence (so that the encoder reads "C B A" when the source is "A B C") substantially improved translation quality. Their hypothesis was that this reduced the "minimum time lag" between input and output: with reversed inputs, the first source word "A" is read last by the encoder and so its information is freshest in the final hidden state when the decoder begins generating the first target word. The trick is now of historical interest only (bidirectional encoders and attention make it unnecessary) but it is a clean illustration of how recurrent architectures' temporal asymmetries can dominate empirical performance, and of the kinds of seemingly ad-hoc fixes the field used before attention was discovered.

12.11.5 Beyond translation

The seq2seq framework rapidly spread to other tasks.

Summarisation. The same encoder-decoder maps a long article to a short summary. The challenges are different from translation: the output is shorter than the input (compression), and there are many valid summaries, and the original ROUGE evaluation metrics reflect this multiplicity. Modern abstractive summarisation uses Transformer-based seq2seq (BART, T5, PEGASUS).

Image captioning. Replace the recurrent encoder with a CNN; feed the image's final feature vector to the decoder as the initial state. Vinyals et al. (2015) demonstrated this with promising results; later work added attention over CNN feature maps (Xu et al. 2015), which became the model for visual question answering and the precursor to multimodal models.

Speech recognition. A bidirectional LSTM encoder over audio-frame features and an LSTM decoder generating characters or word-pieces. Listen, Attend and Spell (Chan et al. 2016) was an influential pure-attention-based ASR system; the Conformer (Gulati et al. 2020) extended it with convolution. Whisper large-v3-turbo (October 2024, 4-decoder-layer distillation, 216× real-time) is the current production default; the open ASR leaderboard's top entries combine a Conformer encoder with an LLM decoder.

Dialogue. Vinyals and Le (2015) trained a seq2seq on movie subtitles to produce conversational responses. The results were charming but limited; modern dialogue systems use Transformer-based models with reinforcement learning from human feedback, but the architectural blueprint is recognisably the same.