Train ResNet on ImageNet, Glossary, Textbook of AI

Recipe for the canonical ResNet-50 image classifier on ImageNet-1k (1.28M training images, 1000 classes). Optimises the cross-entropy loss

$$\mathcal{L} = -\frac{1}{B}\sum_{i=1}^{B}\sum_{c=1}^{1000} y_{i,c}\log p_{i,c}$$

where $p_{i,c} = \mathrm{softmax}(f_\theta(x_i))_c$ and $y_{i,c}$ is the one-hot label. Reproduces He et al. (2016) at ~76.0% top-1 / 92.9% top-5.

Data pipeline.

ImageNet-1k from ILSVRC2012 (~140GB). Decode JPEGs with Pillow-SIMD or DALI for throughput; on 8 GPUs, the data loader, not the GPU, is usually the bottleneck.
Training augmentation per image:
- RandomResizedCrop(224, scale=(0.08, 1.0), ratio=(3/4, 4/3))
- RandomHorizontalFlip(p=0.5)
- ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4) (optional)
- Normalise with mean [0.485, 0.456, 0.406], std [0.229, 0.224, 0.225].
Validation: Resize(256) → CenterCrop(224) → normalise.

Model. Standard ResNet with 50 layers: stem (7×7 conv, stride 2 → maxpool) → 4 stages of bottleneck blocks [3, 4, 6, 3] → global avg pool → linear(2048→1000). Each bottleneck:

class Bottleneck(nn.Module):
    expansion = 4
    def __init__(self, in_c, c, stride=1):
        self.conv1 = nn.Conv2d(in_c, c, 1, bias=False); self.bn1 = nn.BatchNorm2d(c)
        self.conv2 = nn.Conv2d(c, c, 3, stride, 1, bias=False); self.bn2 = nn.BatchNorm2d(c)
        self.conv3 = nn.Conv2d(c, c*4, 1, bias=False); self.bn3 = nn.BatchNorm2d(c*4)
        self.relu = nn.ReLU(inplace=True)
        self.shortcut = (nn.Sequential(
            nn.Conv2d(in_c, c*4, 1, stride, bias=False), nn.BatchNorm2d(c*4)
        ) if stride != 1 or in_c != c*4 else nn.Identity())
    def forward(self, x):
        r = self.shortcut(x)
        x = self.relu(self.bn1(self.conv1(x)))
        x = self.relu(self.bn2(self.conv2(x)))
        x = self.bn3(self.conv3(x))
        return self.relu(x + r)

Initialise the last BN in each residual block with $\gamma=0$ ("Zero-init residual") , gives a clean +0.2% top-1.

Optimiser and schedule.

SGD with Nesterov momentum $\mu=0.9$, weight decay $10^{-4}$ (no decay on BN $\gamma,\beta$ or biases).
Linear warmup from 0 to 0.1 over the first 5 epochs, then step decay ×0.1 at epochs 30, 60, 80. Total 90 epochs.
With batch size 256, base LR is 0.1. Use linear scaling: peak LR $= 0.1 \cdot B/256$.
Mixed precision (torch.amp.autocast(dtype=torch.float16) + GradScaler).
8 GPUs, distributed via DistributedDataParallel, batch 32/GPU.

Training loop.

model = resnet50().cuda()
model = nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])
opt = torch.optim.SGD(no_wd_groups(model), lr=0.1, momentum=0.9, nesterov=True,
                       weight_decay=1e-4)
scaler = torch.amp.GradScaler()

for epoch in range(90):
    train_sampler.set_epoch(epoch)
    for x, y in train_loader:
        x, y = x.cuda(non_blocking=True), y.cuda(non_blocking=True)
        lr = warmup_step_lr(epoch, step, base=0.1, warmup_epochs=5,
                            steps=[30, 60, 80])
        for g in opt.param_groups: g["lr"] = lr

        opt.zero_grad(set_to_none=True)
        with torch.amp.autocast("cuda", dtype=torch.float16):
            logits = model(x)
            loss = F.cross_entropy(logits, y, label_smoothing=0.1)
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()

    validate(model, val_loader)  # report top-1, top-5

Compute estimate. ResNet-50 forward+backward at 224×224 is ~~8.2 GFLOPs/image. 90 epochs × 1.28M images × 8.2 GFLOPs ≈ $9.5\!\times\!10^{17}$ FLOPs. On 8×V100 (~~125 TFLOPs each fp16) at ~70% MFU, ~16 wall-clock hours; on 8×A100, ~6 hours.

Pitfalls. BatchNorm with very small per-GPU batch (≤8) hurts, use SyncBN or batch ≥32/GPU. Forgetting to disable weight decay on BN params loses ~0.5%. Random crops smaller than scale=0.08 over-regularise; larger than 0.7 underfit. Validating only every N epochs hides LR-step regressions, validate every epoch. Mixed precision can underflow grads early; keep GradScaler enabled.

Discussed in:

Chapter 8: Unsupervised Learning, Convolutional Networks
Chapter 10: Training & Optimisation, Training Optimisation

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).