Neural Radiance Fields, Glossary, Textbook of AI

Neural Radiance Fields (NeRF) are a method for representing 3D scenes as the parameters of a small multilayer perceptron, introduced in the ECCV 2020 paper "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis" by Ben Mildenhall, Pratul Srinivasan, Matthew Tancik, Jonathan Barron, Ravi Ramamoorthi and Ren Ng. NeRF launched the modern era of neural 3D representation and won the Test of Time award at ECCV 2024.

Representation. A NeRF is an MLP $F_\theta$ that maps a 5D coordinate, 3D position $\mathbf{x} = (x, y, z)$ and 2D viewing direction $\mathbf{d} = (\theta, \phi)$, to RGB colour $\mathbf{c}$ and volume density $\sigma$:

$$F_\theta : (\mathbf{x}, \mathbf{d}) \mapsto (\mathbf{c}, \sigma).$$

The density $\sigma$ depends only on $\mathbf{x}$; the colour $\mathbf{c}$ depends on both, allowing view-dependent effects like specularities. The MLP is small (8 hidden layers, 256 units) but the input is positionally encoded with sinusoidal features at multiple frequencies:

$$\gamma(p) = \left( \sin(2^0 \pi p), \cos(2^0 \pi p), \ldots, \sin(2^{L-1} \pi p), \cos(2^{L-1} \pi p) \right)$$

which lets the MLP fit high-frequency detail despite its small size, a finding that anticipated similar tricks in transformer position encodings.

Volume rendering. To render a pixel, a camera ray $\mathbf{r}(t) = \mathbf{o} + t \mathbf{d}$ is sampled at $N$ depths $t_i$; the expected colour is

$$\hat{C}(\mathbf{r}) = \sum_{i=1}^{N} T_i (1 - e^{-\sigma_i \delta_i}) \mathbf{c}_i, \qquad T_i = \exp\!\left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right)$$

where $\delta_i = t_{i+1} - t_i$ and $T_i$ is the transmittance from the camera to sample $i$. This is the standard absorption–emission volume rendering integral, discretised by quadrature.

Training. Given a set of posed photographs of a scene, NeRF is trained by minimising the squared error between rendered and ground-truth pixels:

$$\mathcal{L} = \sum_{\mathbf{r}} \left\| \hat{C}(\mathbf{r}) - C^*(\mathbf{r}) \right\|_2^2.$$

Training a single scene takes 1–2 days on a single GPU and produces an MLP of $\sim$5MB that renders novel views at $\sim$30 seconds per frame.

Limitations and successors. Vanilla NeRF is slow to train and slow to render. InstantNGP accelerated training to seconds via hash-grid encoding; Mip-NeRF improved anti-aliasing; NeRF in the Wild handled unconstrained photo collections; 3D Gaussian Splatting eventually displaced NeRF for many real-time applications by replacing the MLP with explicit 3D Gaussians.

Significance. NeRF demonstrated that implicit neural representations could match or exceed traditional 3D reconstruction (multi-view stereo, structure from motion) in quality, while being differentiable end-to-end. This insight has been exported far beyond view synthesis, into robotics (object models), medical imaging (CT reconstruction), and generative 3D (DreamFusion, Magic3D).

Related terms: Gaussian Splatting, InstantNGP

Discussed in:

Chapter 11: CNNs, 3D Scene Representation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).