Object Detection, Glossary, Textbook of AI

Object Detection combines classification and localisation: for each detected object in an image, the model must output both a class label and a bounding box, a tight rectangle enclosing the object. This is substantially harder than classification because the number, size, and aspect ratio of objects varies, and the model must handle occlusion, clutter, and multiple instances of the same class.

The field split early into two-stage and single-stage detectors. R-CNN (2014) and its successors Fast R-CNN and Faster R-CNN propose candidate regions via a learned Region Proposal Network, then classify and refine each. YOLO (You Only Look Once) introduced single-stage detection: divide the image into a grid and predict bounding boxes and classes in a single forward pass, achieving real-time speed. SSD improved on YOLO by predicting boxes at multiple scales. More recently, DETR (Detection Transformer) recast detection as a set prediction problem using transformer encoder-decoders.

Common components include anchor boxes (predefined shapes serving as reference templates), non-maximum suppression (removing redundant overlapping predictions), and intersection over union (IoU) for measuring overlap. Evaluation uses mean Average Precision (mAP), precision averaged across classes at various IoU thresholds. The COCO benchmark averages mAP from IoU 0.5 to 0.95 in steps of 0.05. Object detection powers autonomous driving, surveillance, medical imaging, retail analytics, and robotics.

Discussed in:

Chapter 11: CNNs, Object Detection

This site is currently in Beta. Please get in touch via chrispaton.org with any suggestions, questions or comments.