Sequence-to-Sequence, Glossary, Textbook of AI

Also known as: encoder-decoder

The Sequence-to-Sequence (seq2seq) framework, introduced by Sutskever et al. and Cho et al. in 2014, provides a clean architecture for mapping one variable-length sequence to another. At its core are two components: an encoder that reads the input sequence and produces a compressed representation, and a decoder that generates the output sequence one token at a time, conditioned on that representation. The framework addresses tasks like machine translation, summarisation, image captioning (CNN encoder, RNN decoder), and speech recognition.

In the original formulation, both encoder and decoder are RNNs (typically LSTMs). The encoder processes the input and produces a final hidden state serving as a context vector, a compressed summary of the entire input. The decoder is initialised with this context and generates output autoregressively, taking its own previous output (or ground-truth during training, via teacher forcing) as input for the next step. Generation continues until a special end-of-sequence token is produced.

The information bottleneck of squeezing the entire input into one fixed-length vector limits basic seq2seq on long sequences. This limitation motivated the attention mechanism, which allows the decoder to look back at all encoder hidden states at each step, computing a weighted average focused on the most relevant parts. Attention transformed seq2seq, enabling neural machine translation to surpass statistical systems. The transformer architecture dispensed with recurrence entirely, but the encoder-decoder paradigm itself remains the organising principle of modern models like T5 and BART.

Interactive

Attention as alignment in seq2seq. An encoder produces hidden states; the decoder weights them dynamically per output token.

Video

Discussed in:

Chapter 12: Sequence Models, Sequence-to-Sequence Models

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.