Generative Models: 14.3   Autoregressive image models

Dr Chris Paton

14.3 Autoregressive image models

The chain rule of probability lets us write any joint distribution as a product of conditionals. For a vector of $D$ dimensions:

$$p(x_1, x_2, \ldots, x_D) = \prod_{i=1}^D p(x_i \mid x_1, \ldots, x_{i-1})$$

This factorisation is exact: no approximation, no bound, no adversarial play. If we can model each conditional with a neural network, we have a tractable explicit-density model.

For text this is the mainstream approach (the Transformer decoder). For images, the dimensions are pixels, and an ordering must be imposed (typically raster scan, top-left to bottom-right, channel by channel).

PixelRNN

PixelRNN Oord, 2016 uses an LSTM scanning the image row by row. Each pixel's distribution is conditioned on all pixels above and to the left. For an 8-bit image this is a 256-way classification at each pixel, predict a categorical over intensities. Training maximises the log-likelihood:

$$\mathcal{L}(\theta) = \sum_{i=1}^D \log p_\theta(x_i \mid x_{\lt i})$$

PixelRNN was state-of-the-art when introduced (2016), with bits-per-dimension scores below previous methods.

PixelCNN

PixelCNN replaces the LSTM with a stack of masked convolutions. The mask zeroes out connections from future pixels, ensuring that the receptive field of pixel $i$ contains only $x_{\lt i}$. This makes training fully parallel: every conditional is computed in one forward pass.

A single masked convolution suffices for the first layer; deeper layers use a different mask that no longer needs to exclude the centre pixel (by construction, information about $x_i$ has not yet entered the network). Subsequent improvements, gated PixelCNN, PixelCNN++, refined the architecture with gated activations, mixture-of-logistics output distributions, and conditioning on global features.

Pros and cons

Autoregressive image models are well-defined: they give exact log-likelihood, train stably with cross-entropy, and cover all modes of the data. They are also slow at sampling, generating a $256\times 256$ image requires $256\times 256\times 3 = 196{,}608$ sequential network evaluations. They struggle to capture long-range structure: a model that conditions on a $7\times 7$ window cannot easily enforce that the left and right halves of a face should match.

Autoregressive image generation has largely been displaced by latent-variable and diffusion approaches for high-resolution synthesis, but the chain-rule factorisation remains the dominant approach for language generation, and for any domain where exact likelihood is required (compression, anomaly detection).