The Neural Tangent Kernel (NTK) was introduced by Jacot, Gabriel and Hongler (2018) to describe the training dynamics of wide neural networks. For a network $f(x; \theta)$ with parameters $\theta$, the NTK at parameters $\theta$ is
$$\Theta(x, x'; \theta) = \nabla_\theta f(x; \theta)^\top \nabla_\theta f(x'; \theta).$$
Under NTK parameterisation, weights $W^{(\ell)}_{ij}$ scaled as $1/\sqrt{n_\ell}$ where $n_\ell$ is layer width, and as widths $n_\ell \to \infty$, two remarkable facts hold:
Convergence at initialisation. $\Theta(x, x'; \theta_0) \to \Theta_\infty(x, x')$ in probability, where $\Theta_\infty$ is a deterministic kernel computable in closed form layer-by-layer.
Constancy during training. For finite training time and sufficiently wide networks, $\Theta(x, x'; \theta_t) \approx \Theta(x, x'; \theta_0)$ throughout gradient descent.
Linearised dynamics. Together these imply that gradient flow on the squared loss yields
$$\frac{d f(x; \theta_t)}{dt} = -\Theta_\infty(x, X) \big(f(X; \theta_t) - y\big),$$
a linear ODE in function space. The solution is
$$f(x; \theta_t) = \Theta_\infty(x, X) \Theta_\infty(X, X)^{-1} \big(I - e^{-\Theta_\infty(X, X) t}\big) y$$
plus a residual from initialisation. As $t \to \infty$ the network converges to kernel regression with kernel $\Theta_\infty$:
$$f^*(x) = \Theta_\infty(x, X) \Theta_\infty(X, X)^{-1} y.$$
Closed form. For a fully-connected ReLU network of depth $D$, the NTK can be computed by a recursive formula involving the angle $\theta^{(\ell)}(x, x')$ between hidden representations. The recursion alternates between an arc-cosine kernel update for the activation and a sum that accumulates contributions from each layer's gradient.
Why it matters. The NTK gave the first rigorous analytical handle on training dynamics of deep networks. It explains:
- Convergence. Gradient descent on overparameterised networks converges to zero training loss at a linear rate, because $\Theta_\infty(X, X)$ is positive definite for distinct inputs.
- Generalisation. In the NTK regime, generalisation reduces to that of kernel regression, which is well understood.
- Spectral bias. The eigenvalues of $\Theta_\infty(X, X)$ govern which functions are learned first; low-frequency components dominate, explaining empirical observations that networks fit smooth functions before high-frequency noise.
Limitations. The NTK regime corresponds to lazy training where features do not move from initialisation. Real networks operate in a feature-learning regime where representations evolve substantially. The mean-field (or $\mu$P) parameterisation of Yang and Hu (2021) instead keeps feature learning in the infinite-width limit, giving a different and arguably more relevant theory. Empirically, finite-width networks outperform their NTK on most tasks, indicating that feature learning provides genuine inductive bias beyond kernel methods.
Despite these limitations, the NTK remains a cornerstone of deep learning theory and the starting point for most rigorous results on training dynamics.
Video
Related terms: Gradient Descent, Universal Approximation Theorem, Implicit Regularisation, Double Descent
Discussed in:
- Chapter 6: ML Fundamentals, Theory of Deep Learning