Attention as alignment in seq2seq, Textbook of AI

An encoder produces hidden states; the decoder weights them dynamically per output token.

From the chapter: Chapter 12: Sequence Models

Glossary: attention, seq2seq

Transcript

A French sentence on the left. An English sentence on the right. We want to translate.

The classic seq2seq model. An encoder RNN reads the French, producing one hidden state per word, and a final summary state.

A decoder RNN starts from the summary and generates English one word at a time.

Problem. The summary is a single vector. Long French sentences lose information.

Bahdanau, 2014. Add attention. At each decoder step, instead of using only the summary, look at every encoder state.

Compute an alignment score between the current decoder state and each encoder state. Soft-max into weights that sum to one. Take a weighted average of encoder states. This context vector feeds into the decoder.

Watch the attention pattern as English is generated. Each English word lights up the corresponding French words it attends to.

For the English word "cat", attention concentrates on the French "chat". For "the", attention spreads across the French articles. For verb tenses, attention covers multiple words.

The attention pattern is a soft alignment, learned end to end without explicit supervision.

This was the seed. Three years later, Vaswani's team would replace the RNN entirely with attention, calling the resulting model the Transformer. But the alignment intuition was already complete.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).