Recipe for the canonical ResNet-50 image classifier on ImageNet-1k (1.28M training images, 1000 classes). Optimises the cross-entropy loss
$$\mathcal{L} = -\frac{1}{B}\sum_{i=1}^{B}\sum_{c=1}^{1000} y_{i,c}\log p_{i,c}$$
where $p_{i,c} = \mathrm{softmax}(f_\theta(x_i))_c$ and $y_{i,c}$ is the one-hot label. Reproduces He et al. (2016) at ~76.0% top-1 / 92.9% top-5.
Data pipeline.
- ImageNet-1k from ILSVRC2012 (~140GB). Decode JPEGs with Pillow-SIMD or DALI for throughput; on 8 GPUs, the data loader, not the GPU, is usually the bottleneck.
- Training augmentation per image:
RandomResizedCrop(224, scale=(0.08, 1.0), ratio=(3/4, 4/3))RandomHorizontalFlip(p=0.5)ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4)(optional)- Normalise with mean
[0.485, 0.456, 0.406], std[0.229, 0.224, 0.225].
- Validation:
Resize(256)→CenterCrop(224)→ normalise.
Model. Standard ResNet with 50 layers: stem (7×7 conv, stride 2 → maxpool) →
4 stages of bottleneck blocks [3, 4, 6, 3] → global avg pool → linear(2048→1000).
Each bottleneck:
class Bottleneck(nn.Module):
expansion = 4
def __init__(self, in_c, c, stride=1):
self.conv1 = nn.Conv2d(in_c, c, 1, bias=False); self.bn1 = nn.BatchNorm2d(c)
self.conv2 = nn.Conv2d(c, c, 3, stride, 1, bias=False); self.bn2 = nn.BatchNorm2d(c)
self.conv3 = nn.Conv2d(c, c*4, 1, bias=False); self.bn3 = nn.BatchNorm2d(c*4)
self.relu = nn.ReLU(inplace=True)
self.shortcut = (nn.Sequential(
nn.Conv2d(in_c, c*4, 1, stride, bias=False), nn.BatchNorm2d(c*4)
) if stride != 1 or in_c != c*4 else nn.Identity())
def forward(self, x):
r = self.shortcut(x)
x = self.relu(self.bn1(self.conv1(x)))
x = self.relu(self.bn2(self.conv2(x)))
x = self.bn3(self.conv3(x))
return self.relu(x + r)
Initialise the last BN in each residual block with $\gamma=0$ ("Zero-init residual") , gives a clean +0.2% top-1.
Optimiser and schedule.
- SGD with Nesterov momentum $\mu=0.9$, weight decay $10^{-4}$ (no decay on BN $\gamma,\beta$ or biases).
- Linear warmup from 0 to 0.1 over the first 5 epochs, then step decay ×0.1 at epochs 30, 60, 80. Total 90 epochs.
- With batch size 256, base LR is 0.1. Use linear scaling: peak LR $= 0.1 \cdot B/256$.
- Mixed precision (
torch.amp.autocast(dtype=torch.float16)+GradScaler). - 8 GPUs, distributed via
DistributedDataParallel, batch 32/GPU.
Training loop.
model = resnet50().cuda()
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
opt = torch.optim.SGD(no_wd_groups(model), lr=0.1, momentum=0.9, nesterov=True,
weight_decay=1e-4)
scaler = torch.amp.GradScaler()
for epoch in range(90):
train_sampler.set_epoch(epoch)
for x, y in train_loader:
x, y = x.cuda(non_blocking=True), y.cuda(non_blocking=True)
lr = warmup_step_lr(epoch, step, base=0.1, warmup_epochs=5,
steps=[30, 60, 80])
for g in opt.param_groups: g["lr"] = lr
opt.zero_grad(set_to_none=True)
with torch.amp.autocast("cuda", dtype=torch.float16):
logits = model(x)
loss = F.cross_entropy(logits, y, label_smoothing=0.1)
scaler.scale(loss).backward()
scaler.step(opt)
scaler.update()
validate(model, val_loader) # report top-1, top-5
Compute estimate. ResNet-50 forward+backward at 224×224 is 8.2 GFLOPs/image.
90 epochs × 1.28M images × 8.2 GFLOPs ≈ $9.5\!\times\!10^{17}$ FLOPs. On 8×V100
(125 TFLOPs each fp16) at ~70% MFU, ~16 wall-clock hours; on 8×A100, ~6 hours.
Pitfalls. BatchNorm with very small per-GPU batch (≤8) hurts, use SyncBN
or batch ≥32/GPU. Forgetting to disable weight decay on BN params loses ~0.5%.
Random crops smaller than scale=0.08 over-regularise; larger than 0.7 underfit.
Validating only every N epochs hides LR-step regressions, validate every epoch.
Mixed precision can underflow grads early; keep GradScaler enabled.
Related terms: Convolutional Neural Network, ResNet, Gradient Descent, Mixed Precision Training
Discussed in:
- Chapter 8: Unsupervised Learning, Convolutional Networks
- Chapter 10: Training & Optimisation, Training Optimisation