Autoregressive image models generate images by factorising the joint pixel distribution into a product of conditionals and predicting pixels (or pixel groups) one at a time in a fixed raster order. They were the dominant family of likelihood-based image generators between 2015 and 2020, before being eclipsed for raw sample quality by diffusion and large-scale GANs, but remain influential as the conceptual ancestors of modern multimodal models.
Factorisation. For an image with $N$ pixels in a fixed ordering, the chain rule gives
$$p(\mathbf{x}) = \prod_{i=1}^{N} p(x_i \mid x_1, x_2, \ldots, x_{i-1}).$$
Each conditional is a categorical distribution over pixel intensities (typically 256 levels per channel). Training maximises the log-likelihood:
$$\mathcal{L}(\theta) = -\mathbb{E}_{\mathbf{x} \sim \mathcal{D}}\, \sum_{i=1}^{N} \log p_\theta(x_i \mid x_{\lt i}).$$
PixelRNN (van den Oord et al., 2016a). Uses two-dimensional LSTMs (Row-LSTM, Diagonal BiLSTM) to model dependencies between pixels. Captures long-range structure but is prohibitively slow due to recurrent unrolling.
PixelCNN (van den Oord et al., 2016b). Replaces the recurrence with masked convolutions: each output pixel depends only on already-generated pixels above and to the left. Masking is enforced by zeroing out half the convolutional kernel:
Mask A (first layer): 1 1 1 1 1
1 1 1 1 1
1 1 0 0 0
0 0 0 0 0
0 0 0 0 0
Subsequent layers use Mask B (the centre is also visible). PixelCNN trains far faster than PixelRNN at slight quality cost; Gated PixelCNN and PixelCNN++ narrow the gap.
ImageGPT (Chen et al., 2020). Applies a vanilla transformer decoder to a flattened sequence of pixel tokens (after k-means clustering pixel values into 512 codes). Trained on ImageNet at 64×64 resolution, it learns rich self-supervised representations: linear probes on its features achieve competitive classification accuracy. Demonstrates that the autoregressive recipe scales with model size, foreshadowing the success of GPT on text.
Sampling. Generation is inherently sequential:
- Initialise $\mathbf{x}$ as empty.
- For $i = 1, \ldots, N$: sample $x_i \sim p_\theta(\cdot \mid x_{\lt i})$.
This costs $N$ network evaluations per image, a $32 \times 32$ RGB image needs $3072$ forward passes, making real-time generation infeasible for high resolutions. Caching computations (KV cache for transformers) helps but not enough.
Strengths and weaknesses.
| Strength | Weakness |
|---|---|
| Exact log-likelihoods, principled training | $O(N)$ slow sampling |
| No mode collapse (unlike GANs) | Raster ordering breaks 2-D translation invariance |
| Strong density estimation benchmarks | Sample quality below diffusion at equal compute |
Legacy. The autoregressive paradigm dominates language and survives in images via two routes. First, VQ-VAE + transformer approaches (DALL-E 1, Parti) compress images into discrete latent tokens then model them autoregressively, sidestepping the resolution bottleneck. Second, modern multimodal models such as Chameleon and GPT-4o operate on interleaved text and image tokens with the same autoregressive objective, making the technique central to next-generation systems even as pure pixel-level models faded.
Related terms: Language Model, Transformer, Cross-Entropy Loss, GPT
Discussed in:
- Chapter 11: CNNs, Generative Models