Normalising Flow, Glossary, Textbook of AI

Normalising flows are a family of generative models that learn an exact, tractable density by transforming a simple base distribution $p_Z(\mathbf{z})$ (typically $\mathcal{N}(\mathbf{0}, \mathbf{I})$) into a complex data distribution $p_X(\mathbf{x})$ via an invertible, differentiable map $\mathbf{x} = f_\theta(\mathbf{z})$. Introduced for variational inference by Rezende and Mohamed (2015), they were extended to high-dimensional density estimation by RealNVP (Dinh et al., 2017) and Glow (Kingma & Dhariwal, 2018).

Change-of-variables formula. For any bijective, differentiable $f$,

$$p_X(\mathbf{x}) = p_Z(f^{-1}(\mathbf{x})) \cdot \left| \det \frac{\partial f^{-1}}{\partial \mathbf{x}} \right|,$$

or equivalently in log-form,

$$\log p_X(\mathbf{x}) = \log p_Z(\mathbf{z}) + \log \left| \det \frac{\partial f^{-1}}{\partial \mathbf{x}} \right|, \quad \mathbf{z} = f^{-1}(\mathbf{x}).$$

Composing flows is straightforward: if $f = f_K \circ \cdots \circ f_1$ then the log-determinant decomposes additively, $\sum_{k=1}^{K} \log |\det J_{f_k^{-1}}|$.

Training. Maximum likelihood: minimise the negative log-likelihood of the data,

$$\mathcal{L}(\theta) = -\mathbb{E}_{\mathbf{x} \sim \mathcal{D}}\!\left[ \log p_Z(f_\theta^{-1}(\mathbf{x})) + \log\!\left|\det \frac{\partial f_\theta^{-1}}{\partial \mathbf{x}}\right| \right].$$

Unlike GANs and diffusion, this is a single, stable objective with no min-max dynamics or noise schedules.

The architectural challenge. $f_\theta$ must be expressive yet invertible with a tractable Jacobian. A general $d \times d$ Jacobian determinant costs $O(d^3)$ to evaluate, which is prohibitive for images. Architects therefore design layers with structured Jacobians.

Coupling layers (RealNVP). Split $\mathbf{z} = (\mathbf{z}_a, \mathbf{z}_b)$ and define

$$\mathbf{x}_a = \mathbf{z}_a, \qquad \mathbf{x}_b = \mathbf{z}_b \odot \exp(s(\mathbf{z}_a)) + t(\mathbf{z}_a),$$

where $s, t$ are arbitrary neural networks. The Jacobian is triangular with diagonal $(\mathbf{1}, \exp(s(\mathbf{z}_a)))$, giving

$$\log |\det J| = \sum_i s_i(\mathbf{z}_a),$$

computable in $O(d)$. Inversion is the same expression solved for $\mathbf{z}_b$.

Other flow families.

Autoregressive flows (MAF, IAF). Each output dimension depends on previous ones; sampling is $O(d)$ but evaluation can be $O(1)$ in parallel (or vice versa).
Continuous flows (Neural ODEs, FFJORD). Define $\mathbf{x}(t)$ via $d\mathbf{x}/dt = g_\theta(\mathbf{x}, t)$; the log-determinant becomes an integral of the trace of the Jacobian, evaluated by Hutchinson's estimator.
Glow introduces invertible $1 \times 1$ convolutions, parameterised via LU decomposition, between coupling layers to mix channels.

Properties.

Exact log-likelihoods for evaluation and density estimation.
One-shot, parallel sampling: $\mathbf{x} = f_\theta(\mathbf{z})$ in a single forward pass.
The latent space is the same dimension as data, which limits compression.
Sample quality lags diffusion and modern transformers on images, but flows remain attractive for tasks needing exact densities (anomaly detection, scientific simulation, conditional generation).

Normalising flows are also used in physics (lattice QCD), reinforcement learning (expressive policies), and as the building block of diffusion's score-matching cousins.

Video

Related terms: KL Divergence, Gaussian Distribution

Discussed in:

Chapter 11: CNNs, Generative Models

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.