Semantic Segmentation provides pixel-level understanding by assigning a class label to every pixel in an image. The output is a label map of the same spatial resolution as the input, in which each pixel is coded according to its predicted category (road, car, pedestrian, sky). This dense prediction is essential for applications requiring precise spatial understanding: autonomous driving, medical imaging (tumour delineation), augmented reality, and satellite imagery analysis.
The foundational architecture is the fully convolutional network (FCN) of Long, Shelhamer, and Darrell (2015). They replaced classification networks' dense layers with 1×1 convolutions and added upsampling (transposed convolution) layers that restore spatial resolution. U-Net, developed for biomedical imaging, extended this with a symmetric encoder-decoder structure and long skip connections that concatenate encoder features with decoder features at matching resolutions. U-Net and its variants dominate medical image segmentation.
DeepLab uses dilated (atrous) convolutions and an Atrous Spatial Pyramid Pooling module to capture multi-scale context without losing resolution. Instance segmentation (Mask R-CNN) goes further by distinguishing different instances of the same class. Panoptic segmentation unifies semantic and instance segmentation. Evaluation uses mean Intersection over Union (mIoU). Transformer-based methods like SegFormer and Mask2Former increasingly rival or surpass pure CNN approaches, echoing the broader trend of attention supplementing or replacing convolution in computer vision.
Related terms: Convolutional Neural Network, Object Detection
Discussed in:
- Chapter 11: CNNs — Semantic Segmentation
Also defined in: Textbook of AI