Breakthrough in AI-Driven Protein Analysis
The release of ESMFold2 marks a significant advancement in computational biology. Researchers at the Chan Zuckerberg Biohub unveiled the model on May 27, 2026, alongside a comprehensive atlas containing 1.1 billion predicted protein structures derived from 6.8 billion protein sequences. This open-source resource substantially expands the catalog of known protein forms beyond previous efforts.
Understanding Protein Structure Prediction
Proteins perform essential functions in living organisms, and their three-dimensional shapes determine how they interact with other molecules. Predicting these shapes from amino acid sequences has long challenged scientists. Traditional experimental methods like X-ray crystallography and cryo-electron microscopy provide accurate results but require significant time and resources. AI models now accelerate this process by learning patterns from vast datasets of known structures and sequences.
ESMFold2 builds on protein language models that treat amino acid sequences similarly to words in natural language processing. These models capture evolutionary relationships across billions of sequences, enabling direct inference of atomic-level structures without relying on multiple sequence alignments in many cases.
Development of ESM Models at Scale
The foundation for ESMFold2 traces back to earlier work on Evolutionary Scale Modeling. In 2023, a team developed ESM-2, a 15-billion-parameter language model trained on hundreds of millions of protein sequences. This scaling revealed that structural information emerges naturally in the model's representations, leading to ESMFold for rapid structure prediction.
Biohub recruited key members of the EvolutionaryScale team approximately seven months prior to the May 2026 release. The new system incorporates ESMC, a language model trained on approximately 2.8 billion sequences spanning diverse life forms, including extremophiles and over 20,000 human protein types. ESMFold2 then translates these representations into precise three-dimensional models of proteins and their complexes.
Key Features of the ESM Atlas
The ESM Atlas organizes 6.8 billion sequences and 1.1 billion predicted structures using relationships learned by ESMC. This approach surfaces connections not captured in existing databases, such as evolutionary links between CRISPR-associated defense proteins and a gene-editing protein identified in a soil fungus in 2023, now observed across other eukaryotic species.
Most sequences originate from metagenomic sources in environments like soil and oceans, many previously uncharacterized. The atlas makes this unannotated biology searchable, supporting researchers studying diseases with limited prior molecular understanding.
Photo by Mockup Free on Unsplash
Performance Benchmarks and Comparisons
Biohub reports that ESMFold2 achieves state-of-the-art accuracy, particularly in predicting protein-protein interactions and antibody-antigen complexes. Benchmarking positioned it favorably against AlphaFold 3 from Google DeepMind, Chai-1, and Boltz-1. The model demonstrates strong results on challenging tasks while maintaining computational efficiency suitable for large-scale applications.
Unlike some proprietary systems, ESMFold2 operates fully open source under the MIT license. This accessibility allows global researchers to inspect, modify, and build upon the code without restrictions common in closed models.
Practical Applications in Research and Design
Early uses include designing high-affinity protein binders targeting five disease-related proteins: EGFR and PDGFRβ in cancer pathways, PD-L1 and CTLA-4 as immune checkpoints, and CD45 in immune signaling. Laboratory validation showed a high success rate for these computationally designed molecules.
The atlas supports discovery of novel biology by enabling searches for structural similarities across distant evolutionary branches. Scientists can now explore metagenomic proteins at unprecedented scale, potentially identifying new enzymes or regulatory mechanisms relevant to biotechnology and medicine.
Implications for Academic and Research Communities
University laboratories and independent researchers gain immediate access to tools previously limited by computational barriers or licensing. The open release facilitates integration into existing workflows for structural biology, drug discovery, and synthetic biology programs.
PhD students and postdoctoral researchers in bioinformatics, computational biology, and related fields can train on or extend these models using publicly available code and data. This supports curriculum development in AI applications to life sciences and encourages collaborative projects across institutions.
Technical Accessibility and Infrastructure
Optimized kernels developed in collaboration with NVIDIA enable efficient inference on standard hardware. Researchers with moderate compute resources can process substantial portions of the atlas or generate new predictions without specialized supercomputing facilities.
The full suite—ESMC, ESMFold2, and the ESM Atlas—is hosted on the Biohub Platform for free access. A preprint detailing the methods and results accompanies the release, providing transparency for peer review and further development.
Photo by Mockup Free on Unsplash
Future Directions in Protein Biology Modeling
This release represents progress toward comprehensive world models of protein biology that integrate sequence, structure, and function. Continued scaling and refinement could enable programmable approaches to designing molecular tools for disease prevention and treatment.
Broader adoption may accelerate annotation of the protein universe, revealing functional insights from the vast majority of sequences that remain uncharacterized. Open ecosystems like this one promote reproducibility and innovation across the scientific community.
Broader Context in AI and Life Sciences
Protein structure prediction has evolved rapidly since the introduction of AlphaFold systems, which mapped nearly 200 million structures. The ESMFold2 atlas more than quadruples that scale while emphasizing openness and metagenomic diversity.
Such resources complement experimental efforts and support hybrid approaches where AI predictions guide targeted laboratory validation. The emphasis on evolutionary patterns learned from billions of sequences underscores the value of large-scale data in uncovering biological principles.
