MS COCO (Microsoft Common Objects in Context; Lin, Maire, Belongie et al., ECCV 2014, arXiv:1405.0312) is the foundational benchmark for object detection, instance segmentation, keypoint detection and image captioning. Released by Microsoft Research in 2014 and maintained at https://cocodataset.org, it has been the reference dataset for vision-language work for a decade.
Composition
COCO contains 330,000 images of which 220,000 are labelled. Annotations cover:
- 80 object categories (person, bicycle, dog, chair, etc.) with 2.5 million instance-segmentation masks.
- 91 stuff categories (sky, grass, road, water) added in COCO-Stuff.
- 17 keypoints per person for pose estimation, on roughly 250,000 person instances.
- 5 natural-language captions per image, written by Mechanical Turk workers under guidelines requiring at least eight words and accurate scene description.
Images were sourced from Flickr searches for everyday objects in non-iconic configurations, deliberately distinct from ImageNet's iconic, centred photographs. The original train/val/test splits, 2014, 2015, 2017, remain in active use.
Models benchmarked on COCO
COCO is the canonical detection benchmark. R-CNN, Fast R-CNN, Faster R-CNN, YOLO (all versions through YOLO-v9), SSD, RetinaNet, Mask R-CNN, DETR, Deformable DETR, DINO and Grounding DINO all report COCO AP as their headline number. The standard metric is mean Average Precision averaged over IoU thresholds 0.5 to 0.95 in 0.05 steps (denoted mAP or AP), computed on the COCO test-dev set via the official evaluation server.
For captioning, COCO Captions powered the development of NIC (Vinyals et al. 2015), Show, Attend and Tell, Bottom-Up Top-Down attention, and the captioning evaluations of CLIP, BLIP, BLIP-2, Flamingo and modern multimodal LLMs.
Licensing
COCO images are sourced from Flickr under various Creative Commons licences; annotations are released under CC-BY-4.0 by Microsoft. The dataset is freely usable for research and commercial purposes within those terms.
Limitations
COCO is dated: 80 categories under-represent fine-grained distinctions (no smartphones until 2014; cars are not subdivided into types). Person captions in 2014 reflect implicit biases of the time. The test set has been de facto memorised by some recent models, raising contamination concerns. Newer detection benchmarks, LVIS (1,200 long-tail categories), Open Images Detection, Object365, are now standard for state-of-the-art reporting, but COCO remains the universal reference benchmark because of its decade of comparison curves.
Related terms: ImageNet, Open Images, CLIP
Discussed in:
- Chapter 9: Neural Networks, Computer Vision