CNNs: 11.6   Semantic segmentation

Dr Chris Paton

11.6 Semantic segmentation

Object detection, the subject of §11.5, draws a rectangle around each thing of interest and stamps a class label on it. That is plenty for many tasks, counting cars, locating faces, telling a robot that there is a mug somewhere in this rough region, but it is not enough when the shape of the object matters. A self-driving car needs to know exactly where the road ends and the kerb begins, not the bounding box that contains both. A surgeon planning a resection needs the exact outline of the tumour, not a rectangle that also contains liver, stomach and a couple of vessels. A satellite analyst measuring deforestation needs to know which pixels are forest and which are not. In all of these cases the appropriate output is not a box but a per-pixel class map: a label for every single pixel in the image.

This is the task of semantic segmentation. Given an input image $\mathbf{X} \in \mathbb{R}^{H \times W \times 3}$, produce an output $\mathbf{Y} \in \{1, \dots, K\}^{H \times W}$ in which every pixel carries one of $K$ class labels. The output has the same spatial resolution as the input. It is a dense prediction problem rather than a sparse one, and it demands a different kind of architecture from the box-and-label classifiers of the previous section.

The applications fall into a few familiar buckets. Autonomous driving uses semantic segmentation to label road, lane marking, pavement, vehicle, pedestrian, cyclist, traffic sign and sky, typically as one input to a perception stack that fuses LiDAR and radar. Medical imaging uses it to delineate organs, tumours, lesions and vessels in CT, MRI, ultrasound and microscopy, both to support diagnosis and to plan radiotherapy or surgery. Satellite and aerial imagery uses it for land-cover classification, building footprint extraction and crop-type mapping. Photography and creative tools use it for background removal, portrait-mode bokeh, sky replacement and object-aware editing. The architectures behind all of these have converged on a small family of designs: an encoder-decoder with skip connections (the U-Net lineage), a high-resolution backbone with dilated convolutions (the DeepLab lineage), and, increasingly, transformer-based models that treat segmentation as mask classification (Mask2Former, SAM).

Symbols Used Here

$\mathbf{X}$input image, shape $(H, W, 3)$

$\mathbf{Y}$segmentation map, shape $(H, W)$, each entry in $\{1, \dots, K\}$

$K$number of classes

U-Net architecture

The single most influential segmentation architecture is U-Net, introduced by Olaf Ronneberger, Philipp Fischer and Thomas Brox in 2015 for biomedical image segmentation. A decade on it remains the default starting point for medical imaging and has reappeared, almost unchanged, as the noise-prediction backbone of latent diffusion models. The shape of the network, symmetric downsampling and upsampling paths joined by skip connections, gives the architecture its name.

The encoder is a stack of convolution-pool blocks that progressively halves the spatial resolution while doubling the channel count. The original U-Net used four such blocks: input → 64 channels at full resolution → 128 channels at half resolution → 256 channels at a quarter → 512 channels at an eighth → 1024 channels in a bottleneck at one-sixteenth. Each block is two 3×3 convolutions with ReLU activations followed by a 2×2 max pool. As we go deeper the receptive field grows and the features become more semantic, small filters in the first block respond to edges, and large effective regions in the bottleneck respond to whole anatomical structures, but spatial precision is lost. By the time we reach the bottleneck a single feature vector represents a 16×16 patch of the input.

The decoder mirrors the encoder. Each decoder block performs a 2×2 transposed convolution that doubles the spatial resolution, then concatenates the upsampled feature map with the matching encoder feature map of the same spatial size, and finally runs two 3×3 convolutions to mix the two. After four such blocks the feature map is back at full resolution. A final 1×1 convolution maps the channel dimension to $K$ class logits and a per-pixel softmax produces the segmentation probabilities.

The crucial element is the skip connection at each scale. Without it, the decoder must reconstruct the location of every boundary from a heavily downsampled bottleneck feature, a hopeless task, because the spatial detail has been thrown away. The skip restores that detail by handing the high-resolution encoder features directly to the decoder, where they are concatenated alongside the upsampled deeper features. The decoder then has both: rich semantic context from below, and crisp spatial detail from across. The result is sharp object boundaries even for thin structures such as cell membranes or vessel walls.

U-Net was originally trained on a tiny dataset, 30 images of Drosophila electron microscopy in the 2015 ISBI challenge, and won by a large margin. Its sample efficiency, which is what made it practical for biomedical work where labelled data is always scarce, is a direct consequence of the skip connections: the decoder does not have to learn a generic upsampler from scratch, only to refine encoder features it already trusts. The architecture has since been extended to volumetric data (3D U-Net), nested skip pathways (U-Net++), gated skips (Attention U-Net) and transformer encoders (TransUNet, Swin-UNet). The self-configuring nnU-Net pipeline of Isensee and colleagues has won well over a hundred medical-imaging challenges with what is essentially the original 2015 design, automatically tuned per dataset.

Loss functions

The default per-pixel loss is cross-entropy, averaged over all pixels:

$$ \mathcal{L}_{\text{CE}} = -\frac{1}{HW} \sum_{i=1}^{H} \sum_{j=1}^{W} \sum_{k=1}^{K} y_{ijk} \log \hat{p}_{ijk}, $$

where $y_{ijk}$ is the one-hot true label and $\hat{p}_{ijk}$ the softmax probability for class $k$ at pixel $(i,j)$. This is identical to ordinary classification, simply repeated at every spatial location.

Cross-entropy works well for balanced problems but fails badly when the foreground class is rare, a tumour that occupies 1% of the pixels, a thin vessel, a lane marking against a road. The loss is dominated by the trivial background class and the network learns to predict "background everywhere" before it learns to find the rare foreground. The standard remedy is Dice loss, derived from the Sørensen–Dice coefficient, which measures set overlap and is therefore invariant to class frequency:

$$ \mathcal{L}_{\text{Dice}} = 1 - \frac{2 |A \cap B|}{|A| + |B|} = 1 - \frac{2 \sum_i p_i g_i}{\sum_i p_i^2 + \sum_i g_i^2 + \epsilon}, $$

where $A$ is the predicted mask, $B$ the ground-truth mask, $p_i$ the predicted probability, and $g_i \in \{0, 1\}$ the ground-truth indicator. The numerator counts agreement; the denominator normalises by total area. A small $\epsilon$ prevents division by zero on empty masks.

Most state-of-the-art segmenters use a combined loss, typically Dice plus cross-entropy, in equal weights, which behaves like cross-entropy where there is plenty of data and like Dice where the foreground is rare. Focal loss, met in §11.5, is sometimes substituted for cross-entropy when the imbalance is extreme. Boundary-aware losses (boundary loss, Hausdorff loss) explicitly upweight pixels near the predicted or true boundary, recognising that a one-pixel error at a boundary is qualitatively worse than a one-pixel error in the middle of a uniform region.

Worked example: cell segmentation

A concrete application: counting cells in a fluorescence microscopy image. The input is a $1024 \times 1024$ greyscale frame in which roughly 200 cells appear as bright blobs against a dark background. We want a binary mask of cell vs. background, which we will then post-process to count and measure individual cells.

We assemble a training set of perhaps 50 fully labelled images. We crop random $256 \times 256$ patches and augment them with random rotations, flips and small intensity shifts. We instantiate a standard four-level U-Net with two output channels, cell and background, and train it for 100 epochs with Adam at learning rate $10^{-4}$, using the combined Dice + cross-entropy loss. After a few hours of training on a single GPU the validation Dice score sits above 0.9.

At inference we slide the trained U-Net over the full $1024 \times 1024$ input in overlapping tiles, average the predictions in the overlap regions to avoid seam artefacts, and threshold at 0.5 to produce a binary mask. The mask now has the shape of every cell, but if two cells touch, they merge into a single connected component. A second post-processing step is therefore needed to separate touching instances: the classical choice is the watershed transform seeded by local maxima of the distance transform, which carves the merged blob along the line where the cells almost meet. This combination of a U-Net for semantic segmentation and watershed for instance separation is the basis of pipelines such as Cellpose and Stardist, both widely used in biology labs.

Modern: Mask2Former, SAM

Two recent shifts have changed what state-of-the-art segmentation looks like. The first is the move from per-pixel classification to mask classification, exemplified by Mask2Former (Cheng et al., 2022). Rather than predicting a class for every pixel directly, Mask2Former predicts a small set of binary masks together with a class label for each, and assembles the final segmentation by selecting and combining them. The architecture is transformer-based: a backbone produces multi-scale features, a pixel decoder upsamples them, and a transformer decoder attends to the features through learned mask queries. The same network handles semantic, instance and panoptic segmentation by changing only the matching loss. Mask2Former and its sibling OneFormer dominate the leaderboards on Cityscapes, ADE20K and COCO-Panoptic.

The second shift is the rise of foundation models for segmentation. The Segment Anything Model (SAM), released by Meta in 2023, is the segmentation analogue of GPT-3. It was trained on SA-1B, a dataset of more than a billion masks across eleven million images, generated by a human-in-the-loop annotation engine that bootstrapped from a smaller seed set. SAM accepts an image plus a prompt, a point click, a bounding box, a rough mask, or, in some variants, a text description, and returns a high-quality mask for the indicated object. It has been trained on so much data that it can segment objects it has never been told to recognise, including ones that are entirely outside the categories of any classification benchmark. This is zero-shot segmentation: no fine-tuning, no per-task data collection, no labelling at all from the user beyond a click.

SAM is now the default backbone for any application that needs interactive or open-vocabulary segmentation: image editing tools, video annotation pipelines, robot-perception research, and downstream medical models such as MedSAM. By 2025-26 the strongest medical-imaging pipelines combine nnU-Net with SAM 2 / MedSAM 2 (nnSAM, plug-and-play hybrids), often outperforming either alone. SAM 2, released in 2024, extends the same idea to video, propagating prompts through time so that a single click on the first frame of a clip produces a temporally consistent mask for the whole sequence. The release of SAM has had the same effect on the segmentation community that the release of GPT-3 had on NLP: a generation of bespoke, dataset-specific models has been replaced by prompts to a single foundation model.

Instance and panoptic segmentation

Three closely related tasks are worth distinguishing.

Semantic segmentation labels every pixel with a class but does not separate instances. All cat pixels are labelled "cat", whether they belong to one cat or three.

Instance segmentation separates instances. "Cat 1" and "cat 2" each have their own mask. Background classes such as "sky" or "road" are usually not segmented at all, instance segmentation only produces masks for things, the countable foreground objects. Mask R-CNN (He et al., 2017) is the canonical CNN approach: it extends Faster R-CNN with a per-region mask head, a small fully convolutional network that predicts a binary mask within each detected box.

Panoptic segmentation, introduced by Kirillov and colleagues in 2018, combines the two. Every pixel is assigned both a semantic class and, for thing classes, a specific instance ID. It distinguishes things (countable: cars, people, dogs) from stuff (amorphous: sky, road, grass) and produces a single coherent labelling that is neither purely semantic nor purely instance-based. It is the most complete form of dense visual understanding and has become the canonical evaluation for autonomous-driving perception stacks.

The progression, semantic, instance, panoptic, corresponds to progressively richer questions. What is here? Which instance is this? Both, for every pixel? Modern transformer-based architectures such as Mask2Former handle all three with a single network, distinguished only by the loss.

Where segmentation is used

Semantic segmentation has moved from research curiosity to production tool in every domain that handles images. Self-driving cars use it to produce road, lane, vehicle and pedestrian masks at video frame rates. Medical imaging uses it for organ contouring in radiotherapy planning, tumour delineation in oncology, lesion measurement in dermatology and vessel segmentation in ophthalmology, and increasingly through MedSAM, a SAM variant fine-tuned on medical data. Earth observation uses it for land-cover classification, building footprint extraction (Microsoft and Meta have both released global building datasets generated this way), flood mapping after natural disasters and crop-type mapping for agricultural monitoring. Photography and creative tools use it for background removal, portrait-mode depth effects, sky replacement and object-aware retouching; this is where most consumers encounter segmentation, generally without realising it. Robotics uses it for scene parsing and grasp planning. The barrier to entry has fallen so far, a U-Net trained on a few hundred images, or a few clicks of SAM, that semantic segmentation is now a standard tool in the working toolkit of any team that processes images.

What you should take away

Semantic segmentation labels every pixel with a class. The output is a label map at the same resolution as the input, not a bounding box. Pixel-level detail is required wherever shape matters.
U-Net is the default architecture. Its symmetric encoder–decoder shape, joined by skip connections at each scale, gives the decoder both deep semantic context and crisp spatial detail, and works well even on small datasets.
Use Dice or Dice + cross-entropy when classes are imbalanced. Per-pixel cross-entropy is dominated by the background when the foreground is rare; Dice loss is invariant to class frequency.
Mask classification and foundation models are the new state of the art. Mask2Former unifies semantic, instance and panoptic segmentation under a single transformer-based design; SAM offers prompt-driven, zero-shot segmentation backed by a billion-mask dataset.
Semantic, instance and panoptic differ in what they separate. Semantic merges instances of the same class; instance separates them; panoptic does both, distinguishing things from stuff. The choice depends on the downstream task, not the algorithm.