Exercises
1. Compute the output spatial size of a convolutional layer with input size 224×224, kernel size 7×7, padding 3, stride 2, and dilation 1. Show your working.
2. Repeat Exercise 1 with dilation 2 and explain why the answer differs.
3. A 3×3 convolution has $C_{\text{in}} = 64$ input channels and $C_{\text{out}} = 128$ output channels. How many learnable parameters does the layer have, including biases?
4. Compute by hand the output of convolving the 4×4 input $$X = \begin{pmatrix} 0 & 1 & 2 & 3 \\ 4 & 5 & 6 & 7 \\ 8 & 9 & 10 & 11 \\ 12 & 13 & 14 & 15 \end{pmatrix}$$ with a 2×2 averaging kernel $\frac{1}{4}\begin{pmatrix} 1 & 1 \\ 1 & 1 \end{pmatrix}$, stride 1, no padding. Verify that the output is 3×3.
5. Show that two 3×3 convolutions stacked have the same theoretical receptive field as one 5×5 convolution. Compare the parameter counts (assume $C$ input and output channels throughout).
6. Three 3×3 convolutions stacked have what theoretical receptive field? How many parameters does this stack have, compared with a single 7×7 convolution of the same input/output channel counts?
7. A CNN has the following layers in sequence: conv 5×5 stride 1, pool 2×2 stride 2, conv 3×3 stride 1, pool 2×2 stride 2. Compute the receptive field of one neuron in the output relative to the input image.
8. Explain the difference between a convolution and a cross-correlation. Why is the distinction practically irrelevant in deep learning?
9. Compute the floating-point operation count (multiply-accumulates) for a 3×3 convolution with 64 input and 128 output channels operating on a $56 \times 56$ feature map. Express your answer in MFLOPs.
10. A depthwise separable convolution with kernel size 3, $C_{\text{in}} = 64$ input channels, $C_{\text{out}} = 128$ output channels, applied to a $56 \times 56$ feature map: compute the MFLOPs and the parameter count, and compare with the standard convolution from Exercise 9.
11. State the formula for the gradient of a convolutional layer's output with respect to its input. Why does it have a similar form to the forward pass?
12. Backpropagate through max pooling: explain how gradients flow when a 4×4 input is pooled with a 2×2 window of stride 2 and one of the four windows had its maximum at position (1, 0).
13. Backpropagate through average pooling: derive the gradient passed back through a 2×2 average pool of stride 2.
14. Why does global average pooling reduce overfitting more effectively than a flatten-and-fully-connected head with the same number of output units?
15. Explain the degradation problem that motivated ResNet. Why is it not the same as overfitting?
16. Derive $\partial y / \partial x$ for a residual block $y = \mathcal{F}(x) + x$. Explain why this expression prevents gradient vanishing.
17. Explain why batch normalisation enables much higher learning rates than the same network without it.
18. What are the three sources of normalisation statistics that batch normalisation, layer normalisation, and group normalisation use? In which contexts is each preferred?
19. Compute the parameter count of ResNet-50, given that the residual blocks at each stage have channel widths 256, 512, 1024, and 2048, with 3, 4, 6, and 3 bottleneck blocks per stage respectively, the stem is a 7×7 conv with 64 output channels, and the head is a linear layer to 1{,}000 classes. (Approximate to within 5%; you may neglect bias terms.)
20. Trace through one forward pass of an Inception module: input $C_{\text{in}} = 256$ channels, four parallel branches with output widths 64, 128, 32, 32. Compute the channel count of the concatenated output.
21. Why is an Inception module roughly an order of magnitude cheaper than a single 5×5 convolution of comparable receptive field? Quantify the saving for $C_{\text{in}} = C_{\text{out}} = 256$.
22. Explain compound scaling in EfficientNet. Why does scaling depth, width, and resolution together work better than scaling any one alone?
23. Explain why the focal loss improves single-stage detection. Compute the focal-loss multiplier for $p_t = 0.1, 0.5, 0.9$ with $\gamma = 2$.
24. Sketch the architecture of Faster R-CNN. Identify the role of the Region Proposal Network and the RoI pooling layer.
25. YOLO predicts $(S \times S \times (5B + C))$ outputs per image. Explain each term and compute the output tensor size for $S = 13, B = 5, C = 80$.
26. Define mean average precision (mAP). What does the COCO mAP@[0.5:0.95] metric add to the simpler PASCAL VOC mAP@0.5?
27. Explain why U-Net's long skip connections are essential for high-resolution segmentation. What goes wrong without them?
28. Atrous Spatial Pyramid Pooling (ASPP) uses parallel dilated convolutions at rates 6, 12, 18. Compute the receptive-field size of a 3×3 kernel with dilation 18.
29. Distinguish semantic segmentation, instance segmentation, and panoptic segmentation. Give a concrete example of an image and the output of each.
30. Outline a transfer-learning recipe for a 200-image-per-class medical X-ray classification task starting from an ImageNet-pretrained ResNet-50. Justify each design choice.
31. Explain why CLIP enables zero-shot image classification, and contrast this with the supervised feature-extraction approach of Section 11.7.
32. In the CIFAR-10 ResNet of Section 11.8, suppose we replace BatchNorm2d with GroupNorm (groups = 8). What changes about training, and in what circumstances would GroupNorm be preferred?
33. Modify the CIFAR-10 ResNet to use bottleneck blocks (1×1, 3×3, 1×1) instead of basic blocks. Compute the new parameter count for $n = 3$ blocks per stage with channel expansion factor 4.
34. Discuss the trade-offs between Vision Transformers and CNNs for a problem with 5{,}000 labelled training images. Which is likely to win, and why?
Solution sketches
Solution 1. Using $N_{\text{out}} = \lfloor (N + 2p - d(K-1) - 1)/s \rfloor + 1 = \lfloor (224 + 6 - 6 - 1)/2 \rfloor + 1 = \lfloor 223/2 \rfloor + 1 = 111 + 1 = 112$. Output is 112×112.
Solution 2. With $d = 2$: $N_{\text{out}} = \lfloor (224 + 6 - 2 \cdot 6 - 1)/2 \rfloor + 1 = \lfloor 217/2 \rfloor + 1 = 109$. Output is 109×109. Dilation effectively enlarges the kernel to $1 + d(K-1) = 13$ taps, so it consumes more of the input than the undilated kernel.
Solution 4. Each output is the average of a $2 \times 2$ window. $Y[0,0] = (0+1+4+5)/4 = 2.5$. Continuing: $Y = \begin{pmatrix} 2.5 & 3.5 & 4.5 \\ 6.5 & 7.5 & 8.5 \\ 10.5 & 11.5 & 12.5 \end{pmatrix}$.
Solution 5. Two 3×3 convolutions: receptive field $3 + (3-1) = 5$. Parameters: $2 \cdot 3 \cdot 3 \cdot C \cdot C = 18 C^2$. One 5×5 convolution: $25 C^2$. The stack saves $7 C^2$ parameters and adds an extra non-linearity in the middle.
Solution 6. Three 3×3 convolutions: receptive field $3 + 2 + 2 = 7$. Parameters: $27 C^2$. One 7×7: $49 C^2$. The stack saves $22 C^2$ parameters and inserts two extra non-linearities.
Solution 9. $\text{MACs} = H \cdot W \cdot K^2 \cdot C_{\text{in}} \cdot C_{\text{out}} = 56 \cdot 56 \cdot 9 \cdot 64 \cdot 128 \approx 2.31 \times 10^8 = 231$ MFLOPs (counting MACs as one FLOP).
Solution 10. Depthwise: $56 \cdot 56 \cdot 9 \cdot 64 = 1.81 \times 10^6$ MACs. Pointwise: $56 \cdot 56 \cdot 64 \cdot 128 = 25.7 \times 10^6$ MACs. Total $\approx 27.5$ MFLOPs, an 8.4× saving over the 231 MFLOPs of the standard convolution. Parameter count: depthwise has $9 \cdot 64 = 576$ weights, pointwise has $64 \cdot 128 = 8{,}192$, total 8{,}768 versus the standard layer's $9 \cdot 64 \cdot 128 = 73{,}728$.
Solution 16. For $y = \mathcal{F}(x) + x$, $\partial y / \partial x = \partial \mathcal{F}/\partial x + I$. The identity $I$ provides an unconditional path for gradient flow, so the upstream gradient can never be fully attenuated by the multiplicative effect of small Jacobian terms in $\partial \mathcal{F}/\partial x$. Stacking $L$ residual blocks gives $\partial y_L / \partial x_0 = \prod_\ell (\partial \mathcal{F}_\ell / \partial x_{\ell-1} + I)$, which expands into a sum that includes the bare identity, guaranteeing a nonzero contribution.
Solution 23. Focal-loss multipliers $(1 - p_t)^\gamma$ with $\gamma = 2$: at $p_t = 0.1$, multiplier is $0.81$; at $p_t = 0.5$, multiplier is $0.25$; at $p_t = 0.9$, multiplier is $0.01$. The well-classified example contributes a hundred times less loss than the hard one, so gradient is dominated by hard examples.
Solution 25. For each of the $S \times S$ grid cells the network predicts $B$ bounding boxes (each described by 4 coordinates plus 1 confidence = 5 numbers) and $C$ class probabilities. Total: $13 \cdot 13 \cdot (5 \cdot 5 + 80) = 169 \cdot 105 = 17{,}745$ numbers per image.
Solution 26. The PASCAL VOC metric uses one IoU threshold (0.5). COCO averages over ten IoU thresholds (0.50, 0.55, …, 0.95), rewarding methods that produce tight boxes rather than merely well-positioned loose ones. This makes COCO substantially harder.
Solution 28. A 3×3 kernel with dilation 18 covers a $1 + 18 \cdot (3 - 1) = 37$-pixel-wide region.
Solution 30. (i) Replace the 1{,}000-class head with a fresh 2-class linear layer. (ii) Freeze the backbone and train only the head for 5–10 epochs at $\text{lr} = 10^{-3}$. (iii) Unfreeze the backbone and continue at $\text{lr} = 10^{-4}$ with weight decay $10^{-4}$, augmentation matching X-ray characteristics (random crops, mild rotation, no horizontal flip if anatomy is asymmetric), and a cosine schedule for 50 epochs. (iv) Use class-balanced sampling if the disease class is rare. The small per-class data argues against from-scratch training; X-ray statistics differ enough from natural images that fine-tuning all layers is preferable to freezing most of them; a small learning rate avoids destroying ImageNet features that capture useful low-level structure (edges, textures).
Solution 32. GroupNorm normalises within each example, so its statistics do not depend on the batch. This makes training stable at very small batch sizes (e.g. detection or segmentation systems running with batch 1 or 2 per GPU), where BatchNorm's batch statistics are too noisy. It also avoids the train/eval discrepancy that BN introduces (running averages versus per-batch statistics). On large-batch image classification, BN typically still wins by a small margin.
Solution 34. With only 5{,}000 labelled training images, a CNN with ImageNet pretraining will almost certainly outperform a ViT trained from scratch and is likely to outperform a ViT pretrained only on the same 5{,}000 images. The CNN's convolutional inductive bias (translation equivariance, locality) provides strong sample efficiency. ViTs win when pretraining data is abundant, typically tens of millions of images, because they have weaker inductive biases and need data to compensate. For this problem, fine-tuning a ResNet-50 or ConvNeXt-T from ImageNet is the recommended approach; the ViT alternative would require strong pretraining (e.g. CLIP or MAE on a large corpus) to be competitive.