17.12 Scientific discovery

Artificial intelligence is no longer a tool that scientists use occasionally to speed up a particular calculation. It has become a substrate on which whole fields of research now run. In structural biology, in materials chemistry, in weather forecasting, in pure mathematics, in particle physics and in astronomy, the working pattern of researchers has changed. Hypotheses are generated by neural networks, candidate molecules are screened by graph models, simulated experiments are run by world models, formal proofs are drafted by language models and verified by symbolic engines, and the loop closes with robot-driven laboratories that synthesise the proposals overnight. The boundary between "doing science" and "building AI" has blurred to the point where the most productive scientific groups now look indistinguishable from machine-learning research teams. The cultural transition is genuine and recent: in 2018 it was unusual for a structural biologist to train a neural network; by 2026 it is unusual for one not to.

This section covers the parallel transformation to §17.2, in fundamental science, where the unit of work is the experiment, the model or the proof rather than the consultation. Clinical AI must satisfy regulators, protect patient safety and integrate into hospital workflow. Scientific AI must produce results that other researchers can replicate, hypotheses that survive experimental validation, and code that can be audited by peer reviewers. Both communities have been forced to develop new norms quickly, and both are still finding the right balance between speed and rigour.

Biology

The most widely cited application of AI to fundamental science is protein-structure prediction. The protein-folding problem, given an amino-acid sequence, predict the three-dimensional structure the protein adopts in solution, was a sixty-year grand challenge in biophysics. The Critical Assessment of Structure Prediction (CASP) competition, run since 1994, had seen incremental progress through the 2010s. CASP14, in late 2020, was the discontinuity. AlphaFold 2 (Jumper and colleagues, DeepMind, Nature 2021) produced predicted structures with median backbone accuracy below 1 Å on most CASP14 targets and below 2 Å on nearly all of them. The level of accuracy was at the threshold of experimental crystallography itself, which had been the benchmark for "solved" structure for decades. The model is an attention-based architecture that reasons jointly over a multiple-sequence alignment, a residue-residue pair representation and a 3D structural module that iterates the prediction in physical space. It was trained on the Protein Data Bank, which contains roughly 200,000 experimentally solved structures, augmented by self-distillation on predicted structures.

DeepMind followed AlphaFold 2 with the AlphaFold Protein Structure Database, a free public resource containing predicted structures for over 200 million proteins covering nearly all of UniProt, released through the European Bioinformatics Institute. The database had received over two million users by 2024, and the underlying paper became one of the most cited scientific papers of the decade. AlphaFold 3 (Abramson and colleagues, Nature 2024) extended the architecture to predict the structures of protein–ligand complexes, protein–nucleic-acid complexes, antibody–antigen pairs and post-translationally modified proteins. The expansion turned a sequence-to-structure tool into a tool for reasoning about molecular interaction, which is the unit of biological function and the unit of drug action. Isomorphic Labs, a sister company to DeepMind, has built a drug-discovery business on AlphaFold 3 and related models.

A parallel line of work in protein language models, originating with ESM (Evolutionary Scale Modeling, Rives and colleagues, Meta AI, 2019) and continuing through ESM-2 (2022) and ESM-3 (2024), trains transformers on raw amino-acid sequences as if they were text. The hidden representations these models learn capture structural and functional features that emerge purely from the statistics of evolution. ESM-2 enabled rapid structure prediction for novel proteins without requiring multiple-sequence alignment, which is essential for orphan proteins and synthetic-biology design. ESM-3 added explicit conditioning on structure and function tokens, allowing it to be used generatively.

Generative protein design is the next step. RFDiffusion (Watson and colleagues, Nature 2023) is a diffusion model trained on protein structures that designs proteins from scratch with desired binding properties or symmetry. Its companion ProteinMPNN (Dauparas and colleagues, Science 2022) inverts the structure-to-sequence problem, proposing amino-acid sequences that fold to a specified structure. Together they enable de novo design of binders for therapeutic targets that have resisted small-molecule discovery, and the experimental success rate has been high enough to attract serious commercial investment. Combined with cryo-electron microscopy, AlphaFold prediction and high-throughput synthesis, structural biology has been compressed from a multi-year experimental project per target to a multi-week iteration loop.

Materials science

The materials-science equivalent of AlphaFold arrived with GNoME, Graph Networks for Materials Exploration (Merchant and colleagues, DeepMind, Nature 2023). GNoME is a graph neural network trained on the Materials Project database to predict the formation energy of crystalline solids. By generating candidate compositions and structures and filtering them through GNoME's energy predictions, the team identified 2.2 million new crystal structures, of which approximately 380,000 were predicted to be thermodynamically stable. To put the figure in context, total human knowledge of stable inorganic crystals before GNoME amounted to roughly 48,000 entries in the Inorganic Crystal Structure Database accumulated over a century of crystallography.

GNoME is not, by itself, a replacement for experiment. A predicted-stable structure may still be kinetically inaccessible, expensive to synthesise, or only stable under conditions that disqualify it from practical use. The companion paper from Berkeley's A-Lab (Szymanski and colleagues, Nature 2023) demonstrated the closed-loop version: an autonomous synthesis laboratory that selected GNoME candidates, planned synthesis routes using a separate ML model, ran the syntheses on robotic platforms, characterised the products by X-ray diffraction, and fed the results back. The A-Lab synthesised 41 of 58 attempted novel compounds in 17 days of operation, an order-of-magnitude acceleration over conventional human-led synthesis. The result attracted critical scrutiny, some of the claimed novel compounds proved to be poorly characterised mixtures rather than phase-pure new materials, but the basic principle that AI-driven hypothesis generation can be coupled tightly to experimental validation is now well established.

Force-field models complement structure prediction. MACE (Batatia and colleagues, NeurIPS 2022) and the related Allegro and NequIP families are equivariant graph neural networks that predict atomic forces and energies at near-density-functional-theory accuracy but at orders-of-magnitude lower cost. They have made multi-million-atom molecular-dynamics simulations practical on academic compute budgets, which had previously required custom supercomputers. The downstream applications, battery electrolyte design, catalyst screening, polymer engineering, are at the centre of decarbonisation research. Microsoft's MatterGen and Meta's OMat-24 are further generative models in the same space, and the MLIP (machine-learning interatomic potentials) community has converged on shared benchmarks and pretrained weights in a way that mirrors the trajectory of NLP a decade earlier.

Weather and climate

Numerical weather prediction is one of the great computational achievements of the twentieth century. The European Centre for Medium-Range Weather Forecasts and the US National Weather Service have run global forecasting models for decades, each requiring hundreds of millions of pounds of supercomputer time per year. GraphCast (Lam and colleagues, DeepMind, Science 2023) is a graph neural network trained on forty years of ERA5 reanalysis data that produces 10-day global forecasts at 0.25° resolution in roughly one minute on a single TPU v4, compared with hours of supercomputer time for the conventional models. On the headline metrics, mean absolute error of 500 hPa geopotential height, surface temperature, mean sea-level pressure, GraphCast outperformed the operational ECMWF Integrated Forecasting System on roughly 90% of variables and lead times. Pangu-Weather (Bi and colleagues, Huawei, Nature 2023) reached comparable performance with a different architecture (a hierarchical 3D Earth-specific transformer) at similar speed. ECMWF now runs an AI Forecasting System operationally alongside its traditional model.

The implications go beyond raw forecast skill. Because the models are fast, ensemble forecasting, running many perturbed initialisations to estimate uncertainty, becomes routine. Because they are differentiable, they can be coupled to optimisation: planning energy markets, routing aircraft, scheduling agriculture, sizing flood defences. GenCast (DeepMind, 2024) and GraphCast-Operational extended the original work to probabilistic ensembles. Climate downscaling, turning coarse-resolution global climate model output into local-scale projections useful for adaptation planning, is being transformed similarly by FourCastNet (NVIDIA), CorrDiff (NVIDIA, applied to Taiwanese typhoon downscaling) and academic equivalents. Wildfire risk prediction, demand forecasting for renewable energy, and short-term flood warning are downstream applications that benefit directly from order-of-magnitude faster atmospheric simulation.

The caveats matter. The AI weather models are trained on reanalysis data that is itself produced by traditional models; they are not yet a complete replacement for the physics-based pipeline. They struggle with extreme events that lie outside the training distribution. And the climate-projection problem, forecasting decades ahead under changing forcings, is fundamentally harder than the weather-prediction problem and not directly addressed by current architectures. Even with these caveats, the speedup is genuine and operationally significant.

Mathematics

For a long time mathematics was treated as the field most resistant to AI assistance, on the grounds that mathematical reasoning required something qualitatively different from pattern matching over training data. The argument has weakened considerably. AlphaProof (DeepMind, July 2024) achieved silver-medal performance at the 2024 International Mathematical Olympiad, solving 4 of 6 problems for a score that would have placed in the top 25th percentile of human competitors. The architecture combined a Gemini-based language model that proposed candidate proof tactics in Lean 4 with AlphaZero-style reinforcement learning over the proof-search tree. Each proof was verified mechanically by Lean, removing the possibility of plausible-but-wrong reasoning that had been the failure mode of earlier proof-generation attempts.

AlphaProof's IMO companion, AlphaGeometry 2, extended the 2024 Nature AlphaGeometry paper to handle the harder geometry problems on the 2024 IMO. The pattern, neural proposer plus symbolic verifier plus search, has become the dominant template for difficult mathematical reasoning. By 2025, OpenAI's o3 and Anthropic's Claude 4 demonstrated comparable or stronger performance on the 2025 IMO problems, with full gold-medal scores reported in the first independent evaluations. The cultural reaction in the mathematical community has shifted from scepticism to engagement; conjecture-generation tools, automated proof assistants and AI-driven literature search are now part of working mathematicians' toolkits.

The Lean theorem-prover community has integrated LLM-based tactic suggestion through projects such as LeanCopilot, ProofPilot and Llmstep. Terence Tao's collaboration to formalise his joint paper with James Maynard on prime gaps used Lean with AI assistance and completed substantially faster than purely human formalisation would have permitted. Davies and colleagues' 2021 Nature paper on AI-discovered conjectures in knot theory and representation theory remains the early demonstration that ML pattern-finding can suggest genuinely new mathematical structure, which human mathematicians can then prove formally. The collaborative pattern, machine proposes, human disposes, machine verifies, has so far been more productive than either fully automated proof or fully traditional pencil-and-paper mathematics.

Particle physics, astronomy

The Large Hadron Collider produces roughly one petabyte of raw collision data per second, of which only a small fraction can be stored. The trigger system that decides which collisions to keep relies on increasingly sophisticated machine-learning models, boosted decision trees, then convolutional networks, now graph neural networks operating on the irregular geometry of the detectors. Jet tagging, classifying the cascade of particles produced when a quark or gluon hadronises, has been transformed by graph and transformer architectures, and the ATLAS and CMS collaborations have published a succession of papers improving Higgs-boson, top-quark and exotic-particle searches by exploiting these methods. Anomaly-detection methods adapted from generative modelling are used to search for new physics in collision events that do not fit any expected Standard-Model template.

Astronomy has been similarly remade. The Vera Rubin Observatory, operating from 2025, is producing roughly twenty terabytes of imaging per night and an alert stream of millions of transient events; classification at this scale is only tractable with neural-network pipelines. Galaxy morphology classification, photometric-redshift estimation, gravitational-lens identification and supernova typing are all now standard ML applications. Gravitational-wave detection at LIGO uses matched filtering against template banks, but ML-based pipelines now identify candidate signals faster than the traditional approach and find low-significance events that the templates miss. Pulsar searches in the Square Kilometre Array precursor data use deep learning to triage candidates from interference, where the false-positive rate from radio-frequency interference would otherwise be overwhelming. Across the sky surveys and the experimental programmes, the working assumption is that any new instrument is built with an ML pipeline as a first-class component rather than a downstream add-on.

What you should take away

  1. Structural biology has been transformed. AlphaFold 2 produced near-experimental accuracy on protein structures; AlphaFold 3 extended the work to molecular interactions; ESM language models and RFDiffusion have made de novo protein design routine.
  2. Materials science is in a similar transition. GNoME proposed 380,000 stable crystal structures, the A-Lab demonstrated closed-loop autonomous synthesis, and equivariant force-field models such as MACE have made large-scale molecular dynamics affordable.
  3. Weather forecasting is faster and competitive. GraphCast and Pangu-Weather match operational physics models on most metrics at a tiny fraction of the compute, enabling ensemble forecasting and downstream optimisation that was previously impractical.
  4. Mathematical reasoning is no longer immune. Neural proposer plus symbolic verifier plus search delivered IMO-silver and IMO-gold performance in 2024–2025; mathematicians are now integrating proof assistants and conjecture-generation tools into routine practice.
  5. The unit of progress is the closed loop, not the model. The most productive groups in every field above pair AI hypothesis generation with rigorous experimental or formal validation. The 1990s rational-design optimism failed because computational predictions were treated as sufficient; the current generation has been more careful, and the gains are correspondingly more durable.

This site is currently in Beta. Contact: Chris Paton

Textbook of Usability · Textbook of Digital Health

Auckland Maths and Science Tutoring

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).