ESM-2, Glossary, Textbook of AI

ESM-2 (Evolutionary Scale Modeling, version 2), released by Lin et al. (Meta FAIR) in Science in 2023, is a family of Transformer protein language models trained by masked language modelling on UniRef sequences, scaling from 8M to 15 billion parameters. It is the largest protein language model with publicly released weights and demonstrated that evolutionary information can be absorbed into the weights of a self-supervised model, enabling structure prediction from a single sequence without the multiple-sequence-alignment (MSA) lookup that AlphaFold 2 requires.

The architecture is a standard pre-norm Transformer encoder with rotary position embeddings, trained on ~65 million unique sequences from UniRef50 using BERT-style 15% token masking. The training objective is the per-residue cross-entropy $\mathcal{L} = -\sum_{i \in \mathcal{M}} \log p_\theta(x_i \mid x_{\setminus \mathcal{M}})$ over masked positions $\mathcal{M}$, where the vocabulary is the 20 standard amino acids plus rare and gap tokens. As scale grows, the model's internal attention maps come to recover residue–residue contact patterns that emerge from co-evolution, a phenomenon the team verified by linear-probing attention heads against contact maps in trRosetta benchmarks.

The accompanying ESMFold is a 3B-parameter folding head that ingests ESM-2 embeddings and produces atomic coordinates via an Evoformer-lite trunk and a structure module identical in spirit to AlphaFold 2's. By exchanging MSA search for a single forward pass, ESMFold runs ~60× faster than AF2 at inference, taking seconds rather than minutes and enabling predictions for sequences that have no detectable homologues. The team used ESMFold to release the ESM Metagenomic Atlas: 617 million predicted structures of metagenomic proteins from MGnify, the largest single deposit of structural information in biology and roughly two orders of magnitude bigger than the experimentally solved PDB.

On orphan and metagenomic proteins ESMFold matches or beats AF2 because no MSA exists for AF2 to use; on well-aligned proteins AF2 retains a meaningful edge of a few TM-score points. ESM-2 embeddings have become a default representation across protein machine learning: they outperform one-hot or BLOSUM features for variant-effect prediction, fitness landscapes (DMS regression), enzyme classification, signal-peptide detection and language-model-guided directed evolution. Variant tools such as ESM-IF (inverse folding) and ESM-3 (a multimodal protein model integrating sequence, structure, function tokens) build on the same backbone.

The conceptual contribution is broader than its benchmarks. ESM-2 is the clearest demonstration that self-supervised scaling laws apply to biology: as compute and parameters grew, both perplexity and downstream structural fidelity improved smoothly, mirroring observations on natural-language LLMs. It established the protein language model as a foundational substrate alongside structure-prediction networks.

Video

Related terms: Transformer, Protein Folding, AlphaFold, AlphaFold 3, Foundation Model

Discussed in:

Chapter 17: Applications, Protein Language Models

AI tools used: Claude (research, coding, text), ChatGPT (diagrams, images), Grammarly (editing).