Protein folding is the physical process by which a linear chain of amino acids, the primary sequence, adopts a specific three-dimensional structure that determines the protein's biochemical function. Predicting that structure from sequence alone, the protein-folding problem, was a grand challenge of biology for half a century.
Levels of structure
- Primary: the amino-acid sequence, a string over the 20-letter alphabet of standard residues.
- Secondary: local structural motifs, $\alpha$-helices, $\beta$-sheets, turns, stabilised by backbone hydrogen bonds.
- Tertiary: the global 3-D fold of a single chain, parameterised by the Cartesian coordinates $\{r_i\}$ of each atom or by backbone torsion angles $\phi, \psi, \omega$.
- Quaternary: the assembly of multiple chains (subunits) into a complex.
The Anfinsen hypothesis (Christian Anfinsen, Nobel 1972) states that under physiological conditions the native structure is the global minimum of the free-energy landscape, folding is determined by sequence alone.
History
From Linus Pauling's 1951 prediction of the $\alpha$-helix to John Kendrew's 1958 X-ray structure of myoglobin, structural biology was driven by experimental crystallography, NMR, and increasingly cryo-EM. Computational folding was attempted with molecular dynamics (energy minimisation under empirical force fields such as CHARMM, AMBER), fragment assembly (Rosetta, David Baker), and homology modelling (when a related structure was known). Progress was steady but incomplete, measured biennially by the CASP (Critical Assessment of Structure Prediction) blind competition since 1994.
AlphaFold and the modern era
AlphaFold 2, from DeepMind, declared the problem "essentially solved" at CASP14 in December 2020. Its architecture combined:
- Multiple sequence alignments (MSAs): evolutionary co-variation across homologous sequences signals contacts between residues.
- Evoformer: a Transformer-style block alternating attention over residues and over MSA columns, building a pair representation $z_{ij}$ for every pair of residues.
- Structure module: an SE(3)-equivariant decoder that produces backbone frames and side-chain torsions, refined by recycling.
- End-to-end training on the Protein Data Bank with auxiliary distance-distogram and self-distillation losses.
AlphaFold 2 routinely produces predictions with backbone RMSD under 1 Å for small monomers. AlphaFold 3 (2024) extends to complexes including ligands, nucleic acids and modifications via a diffusion-based decoder. ESM-2 (Meta) and the ESMFold pipeline use a protein language model alone (no MSA) for fast prediction. RFDiffusion (Baker lab) inverts the problem to design new proteins for specified geometries.
The AlphaFold Protein Structure Database now contains predicted structures for over 200 million proteins, nearly every sequence in UniProt , transforming biology, drug discovery and biotechnology.
Video
Related terms: AlphaFold, AlphaFold 3, ESM-2, RFDiffusion, Transformer
Discussed in:
- Chapter 14: Generative Models, AI in Biology and Medicine