ResNet (Residual Network), introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun in 2015, is a CNN architecture that uses residual connections, identity shortcuts that bypass each block of layers, to enable training of very deep networks. The fundamental building block is the residual block:
$$y = F(x; \{W_i\}) + x$$
where $F$ is a small stack of layers (typically 2 or 3 conv-norm-ReLU layers) and the $+x$ term is the identity shortcut. The block is implementable as: compute $F(x)$, add $x$, apply ReLU.
Why this works: the network learns the residual $F(x) = H(x) - x$ rather than the full mapping $H(x)$, which is often easier, particularly when $H$ is close to the identity. The identity shortcut also provides an unobstructed gradient path during backpropagation:
$$\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L}\!\left(1 + \frac{\partial F}{\partial x_l}\right)$$
The "$1+$" term means gradients flow back through the identity even if the residual gradient $\partial F / \partial x_l$ is small, eliminating the vanishing-gradient problem that limited pre-ResNet networks to ~30 layers.
ResNet variants at different depths:
- ResNet-18, -34: basic two-layer residual blocks.
- ResNet-50, -101, -152: bottleneck blocks (1×1 conv reduces channels, 3×3 conv processes, 1×1 conv expands back).
- ResNet-200, -1001: extreme depths.
ResNet-152 won ImageNet 2015 with 3.6% top-5 error (below the human estimate of ~5%).
Pre-activation ResNet (He et al., 2016) moves the activation and normalisation before each residual block:
$$y = x + F(\mathrm{ReLU}(\mathrm{BN}(x)))$$
This gives even cleaner gradient flow and trains more reliably at extreme depth.
Residual connections are now ubiquitous beyond CNNs, every Transformer block has two of them (one around attention, one around the FFN). The 2015 ResNet paper is one of the most-cited papers in computer science. Kaiming He has subsequently extended the methodology to MAE (2021), DDPM and many other architectures.
Video
Related terms: Residual Connection, Convolutional Neural Network, kaiming-he, Vanishing Gradient Problem, Transformer
Discussed in:
- Chapter 11: CNNs, CNNs in Vision