Med-PaLM, Glossary, Textbook of AI

Med-PaLM is a family of medical large language models from Google Research, first announced by Singhal et al. in 2022 and extended in 2023 (Med-PaLM 2) and 2024 (Med-Gemini). The original Med-PaLM took PaLM 540B as base and used instruction prompt tuning , a parameter-efficient adaptation that prepends a small set of soft prompt tokens learned on a curated medical dataset, to specialise it for clinical question answering without disturbing the underlying weights.

The benchmark suite the team assembled, MultiMedQA, became the de facto evaluation for medical LLMs: it bundles MedQA (USMLE multiple-choice), MedMCQA (Indian AIIMS/NEET questions), PubMedQA (yes/no/maybe over abstracts), MMLU clinical subsets, LiveQA and MedicationQA (consumer questions), and a new HealthSearchQA set of 3 173 commonly searched medical queries. Critically the authors built a rubric-based human evaluation scored by clinicians along axes of factuality, harm, bias, scientific consensus, completeness and helpfulness, recognising that exact-match accuracy on multiple choice is a poor proxy for clinical safety.

Original Med-PaLM (Dec 2022) reached 67.6% on MedQA USMLE-style questions, just clearing the 60% passing threshold and being the first LLM to do so. Eight months later Med-PaLM 2, built on PaLM 2 with an ensemble refinement prompting strategy (sample multiple chain-of-thought reasonings, then condition on them to produce a final answer) and a held-in dataset of curated medical demonstrations, reached 86.5% on MedQA, performance the authors characterised as expert physician level on this particular benchmark. The improvement came not just from a stronger base model but from chain-of-thought, self-consistency and ensemble refinement working together.

Subsequent generations extended the family in three directions. Med-PaLM Multimodal added vision (interpreting chest X-rays, mammograms, dermatology photos and genomic plots) by adapting PaLM-E's perception encoder. AMIE (Articulate Medical Intelligence Explorer) layered on a self-play diagnostic-dialogue loop and outperformed primary-care physicians in randomised text-based OSCE-style consultations. Med-Gemini rebased the line on Gemini 1.5, integrating long-context retrieval and tool use over web search, FHIR EHRs and genomic databases.

Med-PaLM is significant for three reasons. First, it demonstrated that general foundation models could, with light medical adaptation, exceed bespoke clinical NLP systems built over decades. Second, it forced the field to take rubric-based clinician evaluation seriously rather than chasing benchmark accuracy. Third, it crystallised the regulatory and deployment debate around clinical LLMs: Google has not released Med-PaLM weights, instead offering it through restricted partner pilots, on the grounds that a model capable of producing plausible-sounding but unsafe medical advice cannot be openly distributed without provenance and oversight.

Related terms: Transformer, Foundation Model, Chain-of-Thought, RLHF

Discussed in:

Chapter 17: Applications, Clinical Language Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).