Generative Models: 14.7   Normalising flows

Dr Chris Paton

14.7 Normalising flows

A normalising flow is a generative model built from a sequence of invertible transformations applied to a simple base distribution, typically an isotropic Gaussian $\mathcal{N}(\mathbf{0}, \mathbf{I})$. The plain-English picture is straightforward. You start with noise that you can sample and evaluate trivially. You then push that noise through a stack of carefully designed neural-network layers, each of which is a bijection: every input maps to exactly one output, and the mapping can be inverted in closed form. The composition of these layers warps the easy base distribution into something that resembles the data (faces, audio waveforms, particle-physics events, molecular geometries) while keeping the mathematics tractable enough that the density of any sample can be written down exactly.

Two properties make flows distinctive. First, because every transformation is invertible, the change-of-variables formula gives an exact log-likelihood. There is no evidence lower bound to slacken (as in the VAE), no minimax saddle to chase (as in the GAN), no partition function to estimate (as in energy-based models). You can train by maximum likelihood directly and the gradient is unbiased. Second, a single forward pass is enough to sample from the model, and a single backward pass through the inverse network is enough to evaluate density. The two operations are the same network run in opposite directions, so they cost the same.

The price of these guarantees is architectural. A general $d \times d$ Jacobian determinant costs $O(d^3)$, which is hopeless for images. The whole research programme on flows is therefore the search for transformations that are simultaneously expressive, easily inverted, and have a Jacobian whose determinant can be read off in $O(d)$ time. This is what §14.5 (the GAN), this section (the flow), and §14.9 (diffusion) all share at the level of motivation: they each learn to sample from a complicated data distribution. Flows alone deliver exact likelihood and fast sampling at the same time. The compromise is that they cannot squash or discard information, which costs them dearly when the data lie on a low-dimensional manifold inside a high-dimensional pixel space, the regime where GANs and diffusion now dominate.

Symbols Used Here

$\mathbf{z}$base sample, e.g. drawn from $\mathcal{N}(\mathbf{0}, \mathbf{I})$

$\mathbf{x}$data sample

$f$invertible transformation $\mathbf{x} = f(\mathbf{z})$, with inverse $\mathbf{z} = f^{-1}(\mathbf{x})$

$\mathbf{J}_f$Jacobian of $f$ with respect to its input

Change-of-variable formula

The starting point is one of the oldest results in measure theory. If $\mathbf{x} = f(\mathbf{z})$ for a smooth invertible $f: \mathbb{R}^d \to \mathbb{R}^d$, the density of $\mathbf{x}$ is the density of $\mathbf{z}$ at the corresponding point, scaled by how much $f$ stretches or compresses volume locally. That stretch is the absolute determinant of the Jacobian $\mathbf{J}_f$. In symbols,

$$p_X(\mathbf{x}) = p_Z(\mathbf{z}) \left|\det \mathbf{J}_f(\mathbf{z})\right|^{-1}, \qquad \mathbf{z} = f^{-1}(\mathbf{x}).$$

Taking logs gives the form used in training,

$$\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}) - \log\left|\det \mathbf{J}_f(\mathbf{z})\right|.$$

Read this as a budget. The first term rewards $\mathbf{z}$ for being likely under the base distribution, pushing the inverted data point towards the centre of the Gaussian. The second term penalises transformations that pile probability mass on top of itself: stretching means $|\det \mathbf{J}_f| > 1$, which subtracts from the log-likelihood; compressing means $|\det \mathbf{J}_f| < 1$, which adds to it. A well-trained flow learns to compress around the data manifold and stretch in directions of low probability.

Composition is the second key fact. If $f = f_K \circ f_{K-1} \circ \cdots \circ f_1$ is a stack of $K$ invertible layers, the chain rule gives a product of Jacobians, and the log-determinant of a product is the sum of log-determinants:

$$\log\left|\det \mathbf{J}_f\right| = \sum_{k=1}^{K} \log\left|\det \mathbf{J}_{f_k}\right|.$$

Each layer contributes independently. This additivity is what makes deep flows feasible. You design one layer whose Jacobian is cheap, and then you just stack them. There is no quadratic blow-up in the cost of likelihood evaluation, only a linear one in the number of layers.

The remaining question is engineering. A general invertible neural network does not have a cheap Jacobian; a fully connected layer with a $d \times d$ weight matrix forces an $O(d^3)$ determinant for the linear part alone, before any nonlinearity. The next two subsections describe the two dominant tricks for reducing that cost to $O(d)$: making the Jacobian triangular by construction. A triangular matrix has a determinant equal to the product of its diagonal, so the log-determinant is just the sum of $d$ scalar logarithms, read off the diagonal in linear time. Coupling layers and autoregressive layers are two different ways of arranging the same triangular structure.

Coupling layers (RealNVP)

Coupling layers, introduced for image flows by Dinh, Sohl-Dickstein and Bengio in NICE (2014) and then RealNVP (2017), give the cleanest construction. Split the input into two halves $\mathbf{x} = (\mathbf{x}_1, \mathbf{x}_2)$, for example along the channel dimension of an image tensor. The layer leaves the first half untouched and transforms the second half by an affine map whose parameters are arbitrary functions of the first half:

$$\mathbf{y}_1 = \mathbf{x}_1, \qquad \mathbf{y}_2 = \mathbf{x}_2 \odot \exp\bigl(s(\mathbf{x}_1)\bigr) + t(\mathbf{x}_1),$$

where $\odot$ denotes elementwise multiplication, and $s$ and $t$ are arbitrary neural networks, typically a residual ConvNet, mapping $\mathbf{x}_1$ to scale and translation parameters of the same dimension as $\mathbf{x}_2$. The layer is invertible by inspection. Given the output $(\mathbf{y}_1, \mathbf{y}_2)$, recover $\mathbf{x}_1 = \mathbf{y}_1$ and then

$$\mathbf{x}_2 = \bigl(\mathbf{y}_2 - t(\mathbf{y}_1)\bigr) \oslash \exp\bigl(s(\mathbf{y}_1)\bigr),$$

where $\oslash$ is elementwise division. Crucially, $s$ and $t$ are evaluated at $\mathbf{y}_1$, which equals $\mathbf{x}_1$, so neither network needs to be itself invertible. They can be as expressive as any standard deep network.

The Jacobian has the block-triangular form

$$\mathbf{J} = \begin{pmatrix} \mathbf{I} & \mathbf{0} \\ \dfrac{\partial \mathbf{y}_2}{\partial \mathbf{x}_1} & \mathrm{diag}\bigl(\exp(s(\mathbf{x}_1))\bigr) \end{pmatrix}.$$

The determinant of a block-triangular matrix is the product of the determinants of the diagonal blocks, so $\det \mathbf{J} = \prod_i \exp(s_i(\mathbf{x}_1))$, and the log-determinant is the simple sum $\sum_i s_i(\mathbf{x}_1)$. The off-diagonal block $\partial \mathbf{y}_2 / \partial \mathbf{x}_1$, which captures how the scale and shift networks respond to changes in $\mathbf{x}_1$, never enters the determinant, though it contributes to gradients during training, which standard automatic differentiation handles correctly.

A single coupling layer leaves half of its inputs frozen, which is too restrictive on its own. The cure is to alternate the partition between layers, so that variables that were untouched by one layer become the active half in the next. RealNVP uses checkerboard masking on spatial dimensions in early layers and channelwise masking after squeeze operations. Glow (Kingma & Dhariwal, 2018) replaced the fixed permutation between coupling layers with an invertible $1 \times 1$ convolution, a learnable channel mixing whose log-determinant is the log-determinant of a $C \times C$ matrix, cheap to compute. Glow added ActNorm, a per-channel affine normalisation initialised so that the first batch has unit variance, and pushed flows to $256 \times 256$ celebrity faces with linear arithmetic on attribute vectors in latent space, the same kind of "smiling minus neutral" semantics that GANs had been showing for years.

The clean separation between the trick that buys cheap Jacobians and the trick that buys expressiveness is what makes coupling-based flows attractive teaching examples. The Jacobian is triangular by construction; the network inside $s$ and $t$ can be anything. Stacking many such layers, alternating which half is updated, gives an arbitrarily deep flow whose log-likelihood evaluation and sampling cost the same, and both are a single forward pass.

Autoregressive flows

Autoregressive flows reach the same triangular Jacobian by a different route. Order the dimensions $1, 2, \ldots, d$, and let each output coordinate depend only on the previous input coordinates. The Masked Autoregressive Flow (MAF) of Papamakarios, Pavlakou and Murray (2017) uses

$$y_i = \frac{x_i - \mu_i(\mathbf{x}_{\lt i})}{\sigma_i(\mathbf{x}_{\lt i})},$$

where $\mu_i$ and $\sigma_i$ are scalar functions of all coordinates with index strictly less than $i$. Implementing $\mu$ and $\sigma$ with a MADE-style masked MLP gives all $d$ pairs in a single forward pass. The Jacobian is lower triangular because $\partial y_i / \partial x_j = 0$ for $j > i$, and its determinant is $\prod_i 1/\sigma_i$. Log-likelihood evaluation is therefore one parallel forward pass, fast. Sampling, however, must be done sequentially, because $y_1$ depends on $x_1$, $y_2$ depends on $x_1$ and $x_2$, and so on; you cannot generate $x_i$ until $x_{\lt i}$ is fixed. MAF is fast at density evaluation and slow at sampling.

The Inverse Autoregressive Flow (IAF) of Kingma et al. (2016) reverses the roles. The same affine structure is used, but parametrised so that $x_i$ is computed from $y_i$ and previous $\mathbf{y}_{\lt i}$ rather than the other way round. Sampling is now one parallel pass, given a base sample $\mathbf{z}$, all $\mathbf{x}_i$ can be produced in parallel, but density evaluation on a fresh data point requires inverting the autoregression sequentially. IAF is fast at sampling and slow at density.

This duality is not just an academic curiosity. Each direction has a natural use. IAFs are the standard tool for flexible variational posteriors $q(\mathbf{z} \mid \mathbf{x})$ in VAEs, where you draw one sample per data point during training and never need to evaluate density on arbitrary $\mathbf{z}$. MAFs are the tool of choice for density estimation, where you fit a model on a dataset and then evaluate likelihoods of held-out points in parallel. Neural Spline Flows (Durkan et al., 2019) replace the affine transform with a piecewise rational-quadratic spline whose monotonicity guarantees invertibility while giving each dimension much greater expressiveness, at the cost of a slightly more elaborate parameter head. WaveNet's autoregressive structure has the same flavour but uses a discrete mixture-of-logistics output and is not, strictly, a normalising flow on a continuous density.

The shared idea across coupling and autoregressive flows is that both impose a strict ordering on dependencies, coupling splits, autoregression sorts, and both end up with a triangular Jacobian whose determinant collapses to a sum of scalar logs.

Continuous flows (Neural ODE)

Chen, Rubanova, Bettencourt and Duvenaud (2018) replaced the discrete stack of layers with a continuous-time ordinary differential equation,

$$\frac{d\mathbf{x}(t)}{dt} = v_\theta\bigl(\mathbf{x}(t), t\bigr), \qquad \mathbf{x}(0) = \mathbf{z}, \quad \mathbf{x}(1) = \mathbf{x}.$$

The map from $\mathbf{z}$ to $\mathbf{x}$ is the solution of this ODE, computed by a numerical integrator such as Runge-Kutta. Because the trajectory is continuous, the change-of-variables formula takes its instantaneous form,

$$\frac{d \log p\bigl(\mathbf{x}(t)\bigr)}{dt} = -\mathrm{tr}\!\left(\frac{\partial v_\theta}{\partial \mathbf{x}}\right),$$

and the total log-determinant becomes the time integral of the trace of the Jacobian. The trace can be evaluated stochastically by Hutchinson's estimator, sample $\boldsymbol{\epsilon}$ with zero mean and identity covariance, and use $\mathrm{tr}(\mathbf{A}) = \mathbb{E}[\boldsymbol{\epsilon}^\top \mathbf{A} \boldsymbol{\epsilon}]$, which costs one Jacobian-vector product, the same price as one ordinary backpropagation step. The architectural constraint disappears: $v_\theta$ can be any neural network, with no triangular structure required.

The price is computational. Each training step integrates an ODE over $[0, 1]$, taking many small steps with adaptive step-size control to keep the integrator's local error in check, and gradients flow through every step. Inference, too, requires solving an ODE. This is much slower in wall-clock terms than a fixed stack of coupling layers. Continuous flows are also vulnerable to stiffness; if $v_\theta$ produces large local Lipschitz constants, the integrator must take tiny steps to remain stable.

Flow matching (Lipman et al., 2023) cured most of these pains and, in doing so, dissolved much of the boundary between flows and diffusion. Instead of integrating the ODE during training, flow matching defines a target velocity field, a simple closed-form interpolation between $\mathbf{z}$ and $\mathbf{x}$, and regresses $v_\theta$ on it directly with an MSE loss. Training is now as cheap as denoising regression: one forward pass, one mean-squared-error gradient. Sampling still requires ODE integration, but with a target field that has been chosen to be straight, the integrator can take large steps. Stable Diffusion 3, Flux and several recent video models are flow-matching models in everything but name. The continuous flow lineage, originally a slow research curiosity, has thereby become one of the dominant generative paradigms of the mid-2020s.

Where flows are used

The unique selling point of a normalising flow is exact density. Anywhere a downstream task needs $\log p(\mathbf{x})$ rather than samples, a flow is the natural tool.

Density estimation and anomaly detection: when the question is "how unusual is this point?", for cybersecurity logs, fraud detection, or out-of-distribution detection in deployed models, a flow gives a direct, calibrated likelihood. GANs and diffusion models give samples, not densities.
Variational posteriors: IAFs let a VAE replace the diagonal-Gaussian posterior with something far more flexible, tightening the evidence lower bound without breaking the reparameterisation trick.
Lossless compression: an invertible neural network is, by construction, a bijection between data and latent codes. Combined with arithmetic coding, this gives lossless compressors that beat PNG and FLAC on benchmark sets.
Particle physics and lattice QCD: Monte Carlo simulations in fundamental physics need to draw samples from distributions known up to a normalising constant, with controlled sample weights. Flows used as proposal distributions in Markov chain Monte Carlo dramatically reduce autocorrelation times. The MIT–DeepMind collaboration on lattice gauge theory is the canonical example.
Molecular conformations and protein structures: equivariant flows (Boltzmann generators of Noé et al., 2019) sample from molecular Boltzmann distributions exactly, which is otherwise hopeless without long molecular-dynamics runs.

What flows are not used for, mostly, is open-ended image and video generation in production. The invertibility constraint is restrictive. A flow cannot squash information away, so it must model nuisance variation (backgrounds, sensor noise, JPEG artefacts) with the same care it gives to the foreground content people actually want. GANs and diffusion can route around this by working in a learned compressed latent space (latent diffusion, §14.12) and so achieve much higher visual fidelity per unit of compute. Discrete coupling and autoregressive flows have largely been displaced from frontier image models. The one surviving lineage is continuous flows reborn as flow matching, which now sits at the heart of state-of-the-art text-to-image and text-to-video systems and is converging with score-based diffusion (§14.13) into a common framework.

What you should take away

A normalising flow is a sequence of invertible neural-network layers applied to a simple base distribution. The change-of-variables formula gives an exact log-likelihood with no lower-bound slack and no partition function to estimate.
Practical flows make the Jacobian triangular by construction, reducing the determinant to a product of $d$ diagonal entries, $O(d)$ time. Coupling layers and autoregressive layers are two ways of imposing this structure.
Coupling layers (RealNVP, Glow) split the input into two halves, leave one half fixed and apply an affine transform to the other. Both forward and inverse cost a single neural-network forward pass, so sampling and density evaluation are equally cheap.
Autoregressive flows trade the two directions: MAFs are fast at density and slow at sampling, IAFs the reverse. Pick the direction by use case, IAFs for variational posteriors, MAFs for density estimation.
Continuous flows replace the layer stack with a Neural ODE; flow matching trains them by direct regression on a target velocity field, and is now the dominant continuous-time generative recipe, the lineage that once produced RealNVP has, via flow matching, become inseparable from modern diffusion.