RFDiffusion, Glossary, Textbook of AI

RFDiffusion, introduced by Watson, Juergens, Bennett et al. (Baker lab, Nature 2023), is a generative diffusion model for de novo protein design that uses a fine-tuned RoseTTAFold network as its denoiser. It generates novel protein backbones conditioned on user-specified design objectives, binding a given target protein, scaffolding a functional motif, or matching a topology, and has produced numerous experimentally validated designs, including high-affinity binders for therapeutic targets that proceed cleanly from sequence to crystal in a single round.

The forward diffusion process corrupts protein backbones by independently adding Gaussian noise to translation vectors and Brownian motion on $\mathrm{SO}(3)$ to residue-frame rotations, $\mathbf{x}_t = \sqrt{\bar\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\bar\alpha_t}\,\boldsymbol{\epsilon}$. The reverse process is parameterised by a RoseTTAFold network whose weights have been fine-tuned for denoising: rather than predicting a structure from sequence, the network at each timestep takes the noised coordinates and predicts the clean coordinates $\hat{\mathbf{x}}_0$, reusing the SE(3)-equivariant attention and triangle updates of the original folding model. Training runs on PDB monomers and oligomers with a denoising objective $\mathcal{L} = \mathbb{E}_{t}\,\|\hat{\mathbf{x}}_0 - \mathbf{x}_0\|^2$ plus auxiliary FAPE and pair losses inherited from RoseTTAFold.

Conditioning is the team's central engineering contribution. They support several modes: (i) unconditional generation of monomeric folds; (ii) symmetric generation by averaging predictions over symmetry copies at each step; (iii) motif scaffolding, where a fixed substructure (e.g. an active site or epitope) is held in place and the rest of the protein is grown around it via inpainting; (iv) binder design, where the noised structure is co-folded against a target protein and partial-diffusion explores the binding surface. Combined with ProteinMPNN for inverse-folding the generated backbone into a sequence, RFDiffusion produces a complete design pipeline.

Experimental validation is unusually strong for a generative-AI paper. Of binder candidates produced for several therapeutically relevant targets (PD-L1, IL-7Rα, influenza HA, SARS-CoV-2 spike), 19% bound the target experimentally on first attempt, orders of magnitude better than the previous Rosetta-based pipeline's ~0.1% hit rate. Crystal structures of designed binders and scaffolded enzymes match predictions to sub-Ångström RMSD. The team has since released RFDiffusion All-Atom, extending generation to full sidechain coordinates and ligand-aware design (binding pocket generation around a small molecule), and RFAA for protein–ligand–nucleic-acid complexes.

RFDiffusion has reshaped protein engineering in roughly the same way AlphaFold 2 reshaped structure prediction: a problem that required years of expert iteration now yields candidates by the thousand at the prompt of a designer. Its existence, together with ESM-2 for representation and AF3 for verification, makes the closed-loop design–build–test cycle for proteins computationally tractable for the first time.

Related terms: Diffusion Model, AlphaFold, AlphaFold 3, Protein Folding

Discussed in:

Chapter 17: Applications, Protein Design

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).