Exercises
- Distinguish the FDA's 510(k), De Novo and PMA pathways for clinical AI authorisation. Give one example of an AI device cleared under each.
- Explain why the 2017 dermatology AI of Esteva and colleagues, despite matching dermatologist performance on its test set, would not by itself constitute adequate evidence for clinical deployment.
- Describe the U-Net architecture of Ronneberger, Fischer and Brox (2015). Why are the skip connections between encoder and decoder essential?
- The nnU-Net framework "self-configures" for new datasets. What configuration choices does it automate and what does it leave to the user?
- Compare the Evoformer and structure-module components of AlphaFold 2. Which contributes most to the system's accuracy and why?
- AlphaFold 3 generalised AlphaFold 2 to handle ligands and nucleic acids. Identify two technical changes required to support this.
- Wong and colleagues (2021) found the Epic Sepsis Model identified only 7% of sepsis cases not already known to clinicians. Discuss what this implies for the value claim of the underlying ML model.
- Define distribution shift and concept drift. Give one clinical example of each.
- Explain the bias mechanism that Obermeyer and colleagues (2019) identified in the risk-stratification algorithm used by US health systems. What proxy variable was the source of the problem?
- Summarise the trade-offs between Tesla's vision-only sensor architecture and Waymo's lidar-camera-radar architecture for autonomous driving.
- Define the disengagement rate. Why is it an imperfect comparator across companies and operational design domains?
- Compare behaviour cloning and reinforcement learning as approaches to robot policy learning. Under what conditions does each excel?
- Diffusion policy (Chi et al., 2023) handles multimodal demonstration data better than a deterministic regression policy. Explain why.
- RT-2 used a 55-billion-parameter vision-language model as its backbone, fine-tuned on robot data. What did this give the system that RT-1 (trained from scratch) did not have?
- GraphCast (DeepMind, 2023) outperforms ECMWF's IFS on most weather metrics for medium-range forecasts. What features of the ML approach drive the advantage, and what concerns remain about replacing physics-based NWP with ML?
- NeuralGCM is a hybrid model combining a differentiable atmospheric core with neural-network parameterisations. Why might this hybrid approach be preferred for climate (decadal-scale) prediction over pure ML?
- GNoME predicted 2.2 million stable inorganic crystals; Berkeley's A-Lab synthesised 41 of 58 attempted candidates. What does this success rate tell us about the predictive accuracy of GNoME and about the synthesis bottleneck?
- SWE-Bench Verified scores progressed from roughly 2% in 2023 to over 75% by 2025. What does an "AI software engineer" scoring 75% on SWE-Bench actually mean about its ability to do real software engineering work?
- Describe the multistage architecture of a modern recommendation system (retrieval, ranking, re-ranking). Why is this decomposition used rather than a single end-to-end model?
- The two-tower retrieval model encodes users and items separately into a shared embedding space. Why is this architecture critical for serving recommendations at scale?
- Brynjolfsson, Li and Raymond (2023) found a 14% productivity gain from generative AI in contact-centre work, with the largest effect on novice workers. Propose two mechanisms that would explain this skill gradient.
- Describe the legal failure mode in Mata v. Avianca (2023). What process change would have prevented it?
- The EU AI Act classifies most medical-AI software as "high-risk". Explain what additional obligations this imposes beyond the EU MDR.
- AlphaProof and AlphaGeometry 2 (DeepMind, 2024) achieved silver-medal IMO performance using a neural language model paired with a symbolic verifier. Why is this hybrid architecture better suited to formal proof than a pure neural language model?
- Explain why Daron Acemoglu's macroeconomic estimates of AI's GDP impact (~0.93% per decade) are substantially smaller than the headline figures from productivity studies (14% in contact centres, 55% for some software tasks). Reconcile the two.
- Why might LLMs be more disruptive to entry-level white-collar work than to senior roles? Use the concept of "task-level" versus "occupation-level" automation in your answer.
- Audit a published clinical AI paper of your choice. Evaluate it against (a) external validation, (b) demographic subgroup analysis, (c) prospective evaluation, (d) failure-mode characterisation. How many of these does it report?
- The author argues that the "best deployments are administrative". Critique this position. Identify a clinical use case where you believe non-administrative deployment is justified, and articulate what evidence would justify it.
- Suno and Udio were sued by the major record labels in June 2024 for training on copyrighted recordings. Outline the fair-use arguments on each side. How does the early ruling in Bartz v. Anthropic (2025) bear on this?
- The 2023 WGA and SAG-AFTRA strikes produced collective-bargaining agreements that explicitly restrict AI use. Identify two specific provisions of these agreements and explain their economic logic.
- Identify three distribution shifts that could plausibly degrade the performance of an AI dermatology classifier deployed across NHS primary-care clinics: in scanner/camera, in patient population, in clinical workflow. For each, propose a mitigation.
- Choose one of: weather, materials, drug discovery, mathematics. Argue, in 300 words, whether the role of AI in that domain is more accurately described as "transformative" or "incrementally accelerating".
- Construct a hypothetical specification for the FDA submission of an AI-based diabetic retinopathy screening device. Cover (a) intended use, (b) target population, (c) study design, (d) primary endpoint, (e) post-market surveillance plan.
Selected solutions
Exercise 2. Esteva and colleagues evaluated on a held-out test set drawn from the same distribution as training (largely Stanford and Edinburgh databases of mostly lighter-skin-tone photographs in standardised lighting). Clinical deployment requires (a) external validation across diverse patient populations and clinical settings; (b) prospective evaluation of how the AI affects clinical decisions and patient outcomes, not merely classification accuracy in isolation; (c) demographic subgroup analysis, particularly across Fitzpatrick skin types I–VI; (d) integration testing in the actual clinical workflow; (e) post-market surveillance plan. The 2017 paper was an important demonstration of feasibility; it was not by itself sufficient evidence for deployment.
Exercise 3. U-Net is an encoder–decoder architecture: the encoder applies 3×3 convolutions with max pooling at each resolution, halving spatial dimensions and doubling feature channels. The decoder applies up-convolutions, doubling spatial dimensions and halving channels. The skip connections concatenate encoder feature maps to the corresponding decoder layers at the same resolution. The skip connections are essential because segmentation requires both high-level semantic information (what is in the image, provided by the deep encoder features) and fine spatial detail (where the boundaries are, preserved in the early encoder features). Without the skips, the upsampling decoder would have to reconstruct boundary detail from coarse semantic features alone, which is fundamentally underdetermined.
Exercise 5. The Evoformer processes a multiple-sequence-alignment (MSA) representation and a pair representation jointly, using both row-wise and column-wise attention on the MSA and triangle-multiplicative and triangle-attention updates on the pair representation. The structure module takes the Evoformer outputs and produces explicit 3D coordinates via invariant point attention. Ablations in the original Jumper et al. (2021) paper indicate that the Evoformer contributes the larger share of accuracy: the pair representation it produces, particularly the triangle updates which encode geometric consistency, is the source of AlphaFold 2's improvement over its predecessors. The structure module converts this representation to coordinates but the geometric reasoning happens upstream.
Exercise 9. Obermeyer and colleagues (2019) found that the algorithm used healthcare cost as a proxy for healthcare need. Black patients with the same medical need (measured by chronic condition count, biomarker abnormalities) incurred lower healthcare costs than white patients with equivalent need, owing to access barriers, distrust of the medical system and other factors. The algorithm therefore systematically under-classified Black patients as needing high-risk care management. The fix, re-anchoring the prediction target on direct measures of medical need rather than cost, recovered a substantial fraction of the missing risk signal for Black patients. The general lesson is that proxy choice in clinical AI is a clinical safety decision, not a technical one.
Exercise 11. The disengagement rate is the number of times per mile that a human safety driver assumes manual control of an autonomous vehicle. It is imperfect because: (a) different companies define "disengagement" differently, some count any human action, others only safety-relevant interventions; (b) the metric depends heavily on the operational design domain, disengagement on a highway in clear weather differs from disengagement on a residential street in rain; (c) it counts disengagements but not near-misses that did not require human intervention; (d) selection effects: companies test where their systems work best, biasing the comparison; (e) reporting requirements differ across jurisdictions, making cross-comparison difficult. Despite these issues, the order-of-magnitude differences between Waymo (>20,000 miles per disengagement) and consumer Tesla FSD (~100–1,000) are meaningful.
Exercise 13. Demonstration data for manipulation tasks is often multimodal: there are several valid ways to grasp a cup, several valid trajectories to a goal, several valid orderings of subtasks. A deterministic regression policy trained to minimise mean-squared-error on this data converges to the mean of the demonstrations, which can lie in an invalid region (an "average" of two valid grasps may be no grasp at all). Diffusion policy models the distribution of demonstration actions via denoising diffusion, sampling action sequences rather than predicting a single point. At inference, sampling produces one of the demonstrated modes rather than their mean, preserving the multimodality.
Exercise 16. Pure ML weather models trained on historical reanalysis data have no guarantee of physical conservation properties (mass, momentum, energy) and tend to drift or fail when extrapolated beyond the timescales seen in training. Climate prediction requires multi-decade integrations during which physical conservation matters and during which the system must respond correctly to forcings (CO₂ levels, solar variation) outside the training distribution. NeuralGCM's hybrid approach, a differentiable physics-based dynamical core (which guarantees the conservation laws by construction) with neural-network parameterisations of sub-grid-scale processes (where ML provides accuracy improvements over hand-tuned schemes), combines the trustworthiness of physics with the accuracy of ML, and remains stable for climate-timescale runs. Pure ML approaches to climate prediction remain an open research area.
Exercise 18. A score of 75% on SWE-Bench Verified means the system can generate a passing patch for 75% of a curated set of solvable real-world GitHub issues. It does not mean the system can do 75% of real software engineering work. SWE-Bench issues are filtered for solvability with limited cross-file scope, are accompanied by clear test signals, and cover bug fixing more than feature design. Real software engineering involves extensive scoping (figuring out what to build), debugging without a clear test signal, navigating evolving requirements, security and operational considerations, code review and human collaboration. Productivity studies suggest current AI software engineers operate as senior-pair-programmer assistants for experienced developers and as full task automation only for narrow, well-specified work. The benchmark progress is real but does not generalise straightforwardly to occupational replacement.
Exercise 20. A two-tower architecture encodes users and items separately into a shared embedding space, $\mathbf{u} = f_U(\text{user features})$ and $\mathbf{v} = f_V(\text{item features})$, with relevance score $s(\mathbf{u}, \mathbf{v}) = \mathbf{u}^\top \mathbf{v}$ (or cosine similarity). The architecture is critical at scale because item embeddings $\mathbf{v}$ can be precomputed for the entire catalogue (often hundreds of millions of items) and indexed in an approximate nearest-neighbour data structure (FAISS, ScaNN, HNSW). At serving time, only the user tower must run; given $\mathbf{u}$, ANN search retrieves the top-$k$ items in $O(\log N)$ time. A cross-encoder model that takes both user and item as input would require computing $N$ cross-encoder forward passes per query, which is infeasible. The two-tower architecture trades a small accuracy loss (the dot-product score is less expressive than a cross-encoder) for a vast efficiency gain that makes scale feasible.
Exercise 22. In Mata v. Avianca, lawyer Steven Schwartz submitted a brief in federal court citing six cases generated by ChatGPT. The cases did not exist; ChatGPT had hallucinated plausible-sounding case names, citations and quotations. Schwartz had not verified the citations. The court sanctioned Schwartz $5,000 and the case became the canonical AI-hallucination cautionary tale. Process changes that would have prevented it: (a) Mandatory citation verification in any database (Westlaw, LexisNexis) before submission; (b) Use of retrieval-grounded LLMs (the modern Harvey, CoCounsel and Lexis+ AI products retrieve from authoritative legal databases rather than generating from parametric memory); (c) Human review of all AI-generated work product, with explicit certification of verification before submission; (d) Court rules requiring AI-generated content to be disclosed and verified, which most US federal courts now have.
Exercise 25. Acemoglu's macroeconomic estimate (0.93% GDP gain per decade from AI) and the per-task productivity studies (14–55% on specific tasks) are reconciled by recognising several gaps: (a) Only a fraction of tasks within an occupation are AI-amenable; the headline gains apply only to those tasks. (b) Only a fraction of occupations are heavily exposed; many of the largest employment categories (healthcare, education, in-person services, construction, transport) are limited in their AI exposure. (c) Per-task gains are realised only after substantial organisational change, workflow redesign, training, integration costs, that takes years. (d) Many AI-augmented activities (creating better marketing copy, writing more emails) may not increase real economic output even if they increase task throughput. (e) Macro effects net against displacement-induced unemployment and wage declines in affected sectors. Acemoglu's estimate represents the net macro signal after these dilution and offsetting effects; the productivity studies represent the un-discounted micro signal in narrow contexts. Both can be correct; they answer different questions.
Exercise 33. A specification for FDA submission of an AI diabetic retinopathy screening device should cover at minimum:
(a) Intended use: autonomous detection of more-than-mild diabetic retinopathy in patients aged 18 and over with a clinical diagnosis of diabetes mellitus, in primary-care or community-screening settings, on a specified set of fundus camera models.
(b) Target population: adults with type 1 or type 2 diabetes meeting screening guidelines (typically annual screening from diagnosis for type 2, from 5 years post-diagnosis for type 1), excluding patients with media opacities precluding retinal imaging.
(c) Study design: prospective, multi-site, single-arm pivotal study against a reference standard of ophthalmologist-graded reading-centre interpretation of widefield retinal photography (or, where available, optical coherence tomography). Sites must include diverse geography and demographics. Sample size powered for primary endpoint with margin allowing demographic subgroup analysis.
(d) Primary endpoint: sensitivity and specificity for detecting more-than-mild diabetic retinopathy, with pre-specified non-inferiority margins against a clinical reference standard. Subgroup analyses by ethnicity, age, sex, image quality and camera model, with pre-specified equivalence margins. Image-quality failure rate as a secondary endpoint (the rate at which the device declines to provide a result) is critical to characterise.
(e) Post-market surveillance: prospective monitoring of performance at deployed sites with quarterly aggregate reporting; bias auditing across demographic groups at six-monthly intervals; mandatory adverse-event reporting; predetermined change control plan for any model updates with retraining and revalidation procedures specified.
The IDx-DR submission of 2018 followed broadly this template and is the baseline against which new diabetic-retinopathy AI submissions are now evaluated.