Autoregressive Image Models, Glossary, Textbook of AI

Autoregressive image models generate images by factorising the joint pixel distribution into a product of conditionals and predicting pixels (or pixel groups) one at a time in a fixed raster order. They were the dominant family of likelihood-based image generators between 2015 and 2020, before being eclipsed for raw sample quality by diffusion and large-scale GANs, but remain influential as the conceptual ancestors of modern multimodal models.

Factorisation. For an image with $N$ pixels in a fixed ordering, the chain rule gives

$$p(\mathbf{x}) = \prod_{i=1}^{N} p(x_i \mid x_1, x_2, \ldots, x_{i-1}).$$

Each conditional is a categorical distribution over pixel intensities (typically 256 levels per channel). Training maximises the log-likelihood:

$$\mathcal{L}(\theta) = -\mathbb{E}_{\mathbf{x} \sim \mathcal{D}}\, \sum_{i=1}^{N} \log p_\theta(x_i \mid x_{\lt i}).$$

PixelRNN (van den Oord et al., 2016a). Uses two-dimensional LSTMs (Row-LSTM, Diagonal BiLSTM) to model dependencies between pixels. Captures long-range structure but is prohibitively slow due to recurrent unrolling.

PixelCNN (van den Oord et al., 2016b). Replaces the recurrence with masked convolutions: each output pixel depends only on already-generated pixels above and to the left. Masking is enforced by zeroing out half the convolutional kernel:

Mask A (first layer):   1 1 1 1 1
                        1 1 1 1 1
                        1 1 0 0 0
                        0 0 0 0 0
                        0 0 0 0 0

Subsequent layers use Mask B (the centre is also visible). PixelCNN trains far faster than PixelRNN at slight quality cost; Gated PixelCNN and PixelCNN++ narrow the gap.

ImageGPT (Chen et al., 2020). Applies a vanilla transformer decoder to a flattened sequence of pixel tokens (after k-means clustering pixel values into 512 codes). Trained on ImageNet at 64×64 resolution, it learns rich self-supervised representations: linear probes on its features achieve competitive classification accuracy. Demonstrates that the autoregressive recipe scales with model size, foreshadowing the success of GPT on text.

Sampling. Generation is inherently sequential:

Initialise $\mathbf{x}$ as empty.
For $i = 1, \ldots, N$: sample $x_i \sim p_\theta(\cdot \mid x_{\lt i})$.

This costs $N$ network evaluations per image, a $32 \times 32$ RGB image needs $3072$ forward passes, making real-time generation infeasible for high resolutions. Caching computations (KV cache for transformers) helps but not enough.

Strengths and weaknesses.

Strength	Weakness
Exact log-likelihoods, principled training	$O(N)$ slow sampling
No mode collapse (unlike GANs)	Raster ordering breaks 2-D translation invariance
Strong density estimation benchmarks	Sample quality below diffusion at equal compute

Legacy. The autoregressive paradigm dominates language and survives in images via two routes. First, VQ-VAE + transformer approaches (DALL-E 1, Parti) compress images into discrete latent tokens then model them autoregressively, sidestepping the resolution bottleneck. Second, modern multimodal models such as Chameleon and GPT-4o operate on interleaved text and image tokens with the same autoregressive objective, making the technique central to next-generation systems even as pure pixel-level models faded.

Related terms: Language Model, Transformer, Cross-Entropy Loss, GPT

Discussed in:

Chapter 11: CNNs, Generative Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).