Causal masking forces a transformer to look only at the past, Textbook of AI

An upper-triangular mask sets future attention scores to minus infinity.

From the chapter: Chapter 13: Attention & Transformers

Glossary: causal attention, decoder

Transcript

Self-attention by default looks everywhere. Each token attends to every other token, before and after. Bidirectional. Good for understanding tasks like classification.

For language modelling, we want to predict the next token from only the past. A token cannot peek at the future without leaking the answer.

Causal masking solves this. Compute the attention scores: queries times keys, scaled by root dimension. We get a square matrix. Rows are queries, columns are keys.

Apply a triangular mask. The lower triangle, including the diagonal, stays as is. The upper triangle gets minus infinity.

Pass through softmax. Minus infinity becomes zero probability. Each token now only attends to itself and earlier tokens.

Visualise the attention weights as a heatmap. A staircase. Token one attends only to itself. Token two attends to tokens one and two. Token n attends to all preceding tokens.

This single mask transforms a transformer into a left-to-right autoregressive model. GPT, Llama, Claude, every modern decoder, all use this mask.

During training, we compute all positions in parallel. The mask ensures each position is trained on its causal context only. At inference time, we generate one token at a time, and the mask is naturally satisfied.

The mask is the difference between an encoder, which sees everything, and a decoder, which sees only the past.

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).