CNNs: 11.5   Object detection

Dr Chris Paton

11.5 Object detection

Image classification asks a single question: what is the dominant object in this picture? Object detection asks a harder pair: what objects are present, and where is each one? The output is no longer a single label but a list, one entry per object, each entry consisting of a class name and a bounding box, a rectangle described by four numbers, typically the coordinates of its top-left corner together with its width and height. This shift from "one label per image" to "many labels per image, each tied to a region" is what makes detection the workhorse of practical computer vision. Self-driving cars need to know where every pedestrian, cyclist, traffic light and parked van is, frame by frame. Medical imaging needs to localise tumours, fractures, microcalcifications, polyps. Retail uses detection to count stock on shelves and to monitor customer flow. Security systems detect faces, vehicles and number plates. Agriculture detects ripe fruit, weeds and livestock from drone imagery. None of these tasks reduces to "this picture is mostly a dog"; all of them require a model that can produce a structured list of objects with locations.

Detection sits between classification (label per image) and segmentation (label per pixel) in both granularity and difficulty.

Symbols Used Here

$\text{IoU}$intersection over union, the overlap ratio between two boxes

$\text{mAP}$mean average precision, area under precision–recall averaged over classes

$B$bounding box, typically $(x, y, w, h)$ or $(x_1, y_1, x_2, y_2)$

$C$class probability assigned by the detector to a candidate region

$N$number of object queries (DETR) or anchors per location (anchor-based detectors)

Two-stage detectors: the R-CNN family

The first deep-learning detection system to beat traditional sliding-window pipelines was R-CNN (Girshick, Donahue, Darrell and Malik, Rich feature hierarchies for accurate object detection, CVPR 2014). It split the problem into two stages and bolted a CNN onto the second:

Region proposals. A class-agnostic algorithm, selective search (Uijlings et al., 2013), produced about 2,000 candidate boxes per image by hierarchically grouping super-pixels. The proposals were not learned; they were a fixed pre-processing step intended to over-cover wherever objects might plausibly be.
Classification and regression. Each proposal was warped to 227×227 pixels and pushed independently through an AlexNet pretrained on ImageNet, producing a 4,096-dimensional feature. A bank of one-versus-rest linear SVMs classified each feature, and a class-specific linear regressor adjusted the box.

R-CNN was accurate, roughly 24% mean average precision on PASCAL VOC, far ahead of pre-deep methods, but agonisingly slow: 2,000 forward passes per image meant about 47 seconds on a GPU and far longer on CPU. It also could not be trained end-to-end, because the region proposals were external and the SVMs were separate from the CNN.

Fast R-CNN (Girshick, ICCV 2015) fixed the inefficiency. The backbone CNN now ran once over the whole image, producing a single feature map. For each proposal, a region-of-interest (RoI) pooling layer cropped and resampled the relevant patch of the feature map to a fixed spatial size (e.g. 7×7), so that a shared classifier and box regressor could process every region at fixed cost. Classification and regression shared a multitask loss, and the entire network beyond the proposals was trained jointly. Inference dropped to about 0.3 seconds per image.

The remaining bottleneck was selective search itself, which still ate two seconds per image. Faster R-CNN (Ren, He, Girshick, Sun, NeurIPS 2015) replaced it with a learned Region Proposal Network (RPN). The RPN is a small CNN that slides over the shared feature map and predicts, at each spatial position, an "objectness" score plus four box-adjustment numbers for each of a fixed set of anchor boxes, pre-defined templates of various scales and aspect ratios. The RPN and the detection head share the same backbone, so the whole system is trained end-to-end, and inference speed climbs to 5–7 frames per second on a contemporary GPU. Faster R-CNN remains the canonical two-stage detector and the baseline against which most newer detectors are still compared.

Mask R-CNN (He, Gkioxari, Dollár, Girshick, ICCV 2017) adds a third small head per RoI that predicts a binary mask, producing instance segmentation alongside boxes. The architecture is otherwise Faster R-CNN with a refined RoI operator (RoIAlign, which uses bilinear interpolation rather than coarse quantisation when sampling the feature map). We shall return to Mask R-CNN in §11.6.

Single-stage detectors: YOLO, SSD and successors

Two-stage detectors are accurate but architecturally fussy: the proposal step, the RoI pool, the separate heads. YOLO (Redmon, Divvala, Girshick, Farhadi, You Only Look Once, CVPR 2016) reframed detection as a single regression. Divide the image into an $S \times S$ grid; let each cell predict $B$ bounding boxes (each with four box coordinates plus a confidence score) and one class distribution over $C$ classes. The network is one CNN whose output is a tensor of shape $(S \times S \times (5B + C))$, all produced in one forward pass. There is no proposal stage, no per-region feature extraction, no RoI pooling. On the original 2016 hardware (a Titan X) YOLO ran at 45 frames per second, fast enough for live video; a "fast YOLO" variant reached 155 fps. The price was accuracy at small objects and dense crowds, where the coarse grid limited how many distinct detections could be produced from one region.

YOLO has been iterated relentlessly. YOLOv2 (2016) added batch norm, anchor boxes and a higher-resolution classifier. YOLOv3 (2018) introduced multi-scale prediction over three feature levels and the Darknet-53 backbone. YOLOv4 (Bochkovskiy, Wang, Liao, 2020) brought CSPNet and a long catalogue of training tricks (Mosaic augmentation, MixUp, CIoU loss, self-adversarial training). YOLOv5 (Ultralytics, 2020) was a PyTorch port that became the most-used version in industry. YOLOv7 (2022), YOLOv8 (2023), YOLOv10 (2024) and YOLOv11 (2024) continued to refine the backbone, the loss, and increasingly favoured anchor-free heads, and successive versions through 2025-2026 (YOLOv12, YOLOv13). The label "YOLO" is now less an architecture than a family of fast single-stage detectors aimed at real-time deployment, and it is what most industrial computer-vision pipelines actually run.

SSD (Liu, Anguelov, Erhan, Szegedy, Reed, Fu, Berg, Single Shot MultiBox Detector, ECCV 2016) made a complementary choice. Rather than predicting from one final feature map, SSD predicts from several feature maps at different depths in the backbone, high-resolution early maps for small objects, low-resolution deep maps for large objects, using a fixed bank of anchor boxes per location at every scale. SSD fitted neatly between Faster R-CNN and YOLO on the speed-accuracy curve and popularised the idea that scale should be handled inside the detector rather than by image pyramids.

RetinaNet (Lin, Goyal, Girshick, He, Dollár, Focal Loss for Dense Object Detection, ICCV 2017) tackled the chronic weakness of single-stage detectors: foreground-background imbalance. With tens of thousands of anchors per image and only a handful of objects, the cross-entropy loss is dominated by easy negatives, swamping the gradient on rare positives. The focal loss

$$ \mathcal{L}_{\text{focal}}(p_t) = -\alpha_t (1 - p_t)^\gamma \log p_t, $$

down-weights well-classified examples by $(1 - p_t)^\gamma$. With $\gamma = 2$, an example with predicted probability $p_t = 0.9$ contributes only $0.01\times$ as much loss as ordinary cross-entropy would assign. RetinaNet was the first single-stage detector to match two-stage accuracy while keeping single-stage speed, and the focal loss now appears across the wider deep-learning toolbox whenever a long-tailed distribution threatens to dominate training.

A subtler trend is the move to anchor-free heads. Anchor-based detectors require a designer to choose anchor scales and aspect ratios, and the choices materially affect accuracy. FCOS (Tian, Shen, Chen, He, ICCV 2019) predicts, at each spatial position, a 4-tuple $(l, t, r, b)$ giving distances to the four sides of the enclosing box, plus a "centre-ness" score that suppresses predictions far from object centres. CenterNet (Zhou, Wang, Krähenbühl, Objects as Points, 2019) treats each object as a single keypoint at its centre and regresses its size from that keypoint. Modern YOLO variants now adopt anchor-free or hybrid heads as a matter of course.

DETR: detection as set prediction

DETR (Carion, Massa, Synnaeve, Usunier, Kirillov, Zagoruyko, End-to-End Object Detection with Transformers, ECCV 2020) reformulated detection as set prediction. A CNN backbone produces a feature map; a Transformer encoder–decoder consumes it together with a fixed number $N$ of learned object queries, vectors that act as slots, each of which will produce one prediction. Each query emits either an object (a class label and a box) or a "no object" token. The number of queries $N$ is set well above the maximum expected number of objects in any image; padding to $N$ removes the variable-length output that ordinary detectors must contend with.

The training loss is the conceptual core. Since the model predicts an unordered set of $N$ items and the ground truth is an unordered set of objects (also padded to $N$ with "no object"), the loss must be permutation-invariant. DETR uses the Hungarian algorithm to find the bipartite matching between predictions and targets that minimises the total assignment cost, where the cost combines a classification term and a box-regression term. Once the matching is fixed, a per-pair loss is applied. This single trick makes anchors and non-maximum suppression unnecessary: each ground-truth object has exactly one predictor, and duplicates are penalised by being matched to "no object".

DETR is conceptually clean but slow to converge; the original required 500 training epochs on COCO. Deformable DETR (Zhu et al., 2021) replaced full attention with sparse, deformable attention that samples a small number of feature locations per query, cutting training to about 50 epochs and improving small-object accuracy. DINO (Zhang et al., 2023) added contrastive denoising and mixed query selection, reaching state-of-the-art on COCO at 12–24 epochs. The DETR family is now the preferred starting point for open-vocabulary detection (Grounding DINO) and unified detection-segmentation systems that connect to text encoders and to the Segment Anything Model (SAM).

Evaluation: IoU, AP and mAP

Detection needs an evaluation protocol that handles boxes, classes and confidences simultaneously. The starting block is intersection over union:

$$ \mathrm{IoU}(A, B) = \frac{|A \cap B|}{|A \cup B|}, $$

the area of overlap divided by the area of union. A predicted box counts as a true positive if its IoU with a ground-truth box of the same class exceeds a threshold (typically 0.5 on PASCAL VOC), and each ground-truth box can be matched at most once. Predictions left over are false positives; ground-truth boxes left unmatched are false negatives.

For each class, sweep the confidence threshold from high to low, recording precision and recall at every threshold. The average precision (AP) is the area under this precision–recall curve. The mean average precision (mAP) averages AP across all classes. Worked example: suppose a class has 100 ground-truth boxes and the detector produces 80 candidates ranked by confidence. Walking down the ranked list we mark each as TP (IoU ≥ 0.5 with an unmatched ground truth) or FP, and at every step we compute precision = TP/(TP+FP) and recall = TP/100. The resulting curve might rise sharply at the top of the list (high precision when only the most confident predictions count), then bend down as we accept lower-confidence predictions. The area under this curve is the AP for that class. Average across all classes for mAP.

COCO uses a stricter version: it averages mAP over IoU thresholds from 0.5 to 0.95 in steps of 0.05, and reports separately for small, medium and large objects. The headline number "COCO mAP" therefore rewards both detection and tight localisation. As of 2024 the strongest detectors (DINO, Co-DETR, RT-DETR, YOLOv11) reach roughly 60% COCO mAP, up from R-CNN's 24% in 2014. The progress is not pure architecture: better backbones (Swin, ConvNeXt), better augmentation (Mosaic, MixUp, scale-jitter), more pre-training data (Objects365 layered onto COCO) and longer training schedules each contributed.

Practical use

Self-driving and ADAS: Tesla, Waymo, Cruise and Mobileye all run real-time multi-class detectors at 30+ frames per second on dedicated automotive accelerators. The detector is one stage in a longer perception stack that also includes tracking, depth estimation and trajectory prediction.
Medical imaging: lesion localisation in mammography, polyp detection in colonoscopy video, pulmonary nodule detection in chest CT, microaneurysm detection in retinal images. Detection often feeds a downstream segmentation model (e.g. nnU-Net) for precise boundary delineation.
Retail and logistics: shelf stock counting, customer flow analytics, pallet-and-box detection on conveyors, defect detection in manufacturing. YOLO variants dominate this market because the latency budget is strict and the classes per task are small.
Security and surveillance: face detection (a precursor to face recognition), number-plate detection, person-and-vehicle detection in CCTV. RetinaFace and YOLO are the industry defaults.
Agriculture and ecology: ripe-fruit detection on harvesting robots, weed detection for precision spraying, livestock counts and welfare monitoring from drones, wildlife camera-trap analysis.
Sport and broadcast: ball and player tracking, automatic camera framing, foul detection. The data are dense and fast, and accuracy on small distant objects is the binding constraint.

The choice of detector follows the deployment budget. If it must run on a phone or on an embedded automotive chip at 60 fps, a small YOLO variant is the default. If it runs in a data centre with no latency constraint and accuracy is paramount, a DINO-class transformer detector is preferred. Two-stage Faster R-CNN with FPN remains the reference baseline for new research and for problems where labelled data are scarce, because its inductive biases (anchors, RoI pool, separate heads) generalise robustly from modest datasets.

What you should take away

Detection outputs structured lists: each detected object has a class label and a bounding box. The shift from "one label per image" to "many labels per image with locations" is what makes practical computer vision possible.
The architectural arc has gone from two-stage to single-stage to set prediction: R-CNN → Fast R-CNN → Faster R-CNN → YOLO → RetinaNet → DETR → DINO. Two-stage detectors remain the accuracy reference; single-stage detectors dominate real-time deployment; transformer set-prediction detectors lead the modern leaderboards.
IoU is the spatial currency: it underpins the matching of predictions to ground truth, the post-processing (NMS), the box-regression loss (CIoU) and the evaluation metric (mAP). Master IoU and most of the rest follows.
mAP is the standard benchmark: precision–recall area averaged across classes, with COCO further averaging over IoU thresholds 0.5 to 0.95. Headline progress in detection has been measured almost entirely by COCO mAP since 2014.
Choose detectors by deployment constraint: small-YOLO for edge devices and real-time pipelines, Faster R-CNN with FPN as a robust research baseline, DETR-family models when accuracy on a server beats every other consideration. The architectural family matters less than matching backbone, training data and augmentation to the deployment regime.