MedSAM, Glossary, Textbook of AI

MedSAM, introduced by Ma, He, Li and Wang in Nature Communications in 2024, is a foundation model for promptable medical-image segmentation built by adapting Meta's Segment Anything Model (SAM) to the medical domain. The motivation is straightforward: SAM, trained on 1.1 billion natural-image masks, generalises well to held-out objects but performs poorly on CT, MRI, ultrasound, X-ray, histopathology and microscopy because medical images are distributionally far from web photographs (low contrast, greyscale, modality-specific noise, 3D context, atypical aspect ratios).

The architecture inherits SAM's three-component design: a heavyweight image encoder (ViT-Base by default, ~90M parameters) that produces a $64\times 64$ feature embedding; a lightweight prompt encoder that converts user prompts (points, bounding boxes, masks) into prompt tokens; and a mask decoder (a two-layer transformer with cross-attention) that fuses image and prompt embeddings to output a segmentation mask plus an estimated IoU score. The prompt encoder uses positional encodings with learned embeddings for foreground/background semantics.

To create MedSAM the team curated MedSAM-1M, a corpus of over 1.5 million image–mask pairs spanning 10 imaging modalities and 30+ cancer types, by aggregating dozens of public datasets and converting volumetric data to 2D slices. They then fine-tuned the entire model on bounding-box prompts only, a deliberate choice, because in clinical practice radiologists already draw rough rectangles around lesions, whereas point prompts are ambiguous in 3D contexts. The training loss is a weighted sum of unnormalised focal loss (handling class imbalance) and Dice loss, $\mathcal{L} = 20\,\mathcal{L}_{\text{focal}} + \mathcal{L}_{\text{Dice}}$, with AdamW at $\eta = 10^{-4}$.

On 86 internal and external validation tasks MedSAM matched or exceeded modality-specific specialist networks, including nnU-Net trained on the same labelled data, when given the same bounding-box prompt at inference. Crucially, performance held up on completely unseen modalities (e.g. retinal OCT, dental X-ray) demonstrating zero-shot transfer. The trade-off is that MedSAM is interactive by design: it requires a human-supplied prompt rather than running fully automatically, which suits clinical workflows where a radiologist is already in the loop but is unsuitable for fully unsupervised pipelines.

Subsequent work has explored MedSAM-2 with video tracking through volumetric stacks, SAM-Med3D with native 3D positional encodings, and various adapter-based fine-tunes (LoRA-SAM, Med-SA) that update only a small fraction of parameters. MedSAM has become a popular base for downstream tools, annotation accelerators, semi-automatic labelling, and human-in-the-loop active learning, even where final inference uses a smaller specialist network.

Related terms: U-Net, nnU-Net, Foundation Model, Vision Transformer

Discussed in:

Chapter 17: Applications, Medical Imaging

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).