Open-Source ESMFold2 Model Maps Shapes of 1.1 Billion Proteins

Biohub Release Expands Protein Universe with State-of-the-Art AI Tools

computational-biology
protein-structure-prediction
research-tools
esmfold2
open-source

a black and white photo of a map — Photo by Hans on Unsplash

Breakthrough in AI-Driven Protein Analysis

The release of ESMFold2 marks a significant advancement in computational biology. Researchers at the Chan Zuckerberg Biohub unveiled the model on May 27, 2026, alongside a comprehensive atlas containing 1.1 billion predicted protein structures derived from 6.8 billion protein sequences. This open-source resource substantially expands the catalog of known protein forms beyond previous efforts.

Understanding Protein Structure Prediction

Proteins perform essential functions in living organisms, and their three-dimensional shapes determine how they interact with other molecules. Predicting these shapes from amino acid sequences has long challenged scientists. Traditional experimental methods like X-ray crystallography and cryo-electron microscopy provide accurate results but require significant time and resources. AI models now accelerate this process by learning patterns from vast datasets of known structures and sequences.

ESMFold2 builds on protein language models that treat amino acid sequences similarly to words in natural language processing. These models capture evolutionary relationships across billions of sequences, enabling direct inference of atomic-level structures without relying on multiple sequence alignments in many cases.

Development of ESM Models at Scale

The foundation for ESMFold2 traces back to earlier work on Evolutionary Scale Modeling. In 2023, a team developed ESM-2, a 15-billion-parameter language model trained on hundreds of millions of protein sequences. This scaling revealed that structural information emerges naturally in the model's representations, leading to ESMFold for rapid structure prediction.

Biohub recruited key members of the EvolutionaryScale team approximately seven months prior to the May 2026 release. The new system incorporates ESMC, a language model trained on approximately 2.8 billion sequences spanning diverse life forms, including extremophiles and over 20,000 human protein types. ESMFold2 then translates these representations into precise three-dimensional models of proteins and their complexes.

Key Features of the ESM Atlas

The ESM Atlas organizes 6.8 billion sequences and 1.1 billion predicted structures using relationships learned by ESMC. This approach surfaces connections not captured in existing databases, such as evolutionary links between CRISPR-associated defense proteins and a gene-editing protein identified in a soil fungus in 2023, now observed across other eukaryotic species.

Most sequences originate from metagenomic sources in environments like soil and oceans, many previously uncharacterized. The atlas makes this unannotated biology searchable, supporting researchers studying diseases with limited prior molecular understanding.

Photo by Mockup Free on Unsplash

Performance Benchmarks and Comparisons

Biohub reports that ESMFold2 achieves state-of-the-art accuracy, particularly in predicting protein-protein interactions and antibody-antigen complexes. Benchmarking positioned it favorably against AlphaFold 3 from Google DeepMind, Chai-1, and Boltz-1. The model demonstrates strong results on challenging tasks while maintaining computational efficiency suitable for large-scale applications.

Unlike some proprietary systems, ESMFold2 operates fully open source under the MIT license. This accessibility allows global researchers to inspect, modify, and build upon the code without restrictions common in closed models.

Practical Applications in Research and Design

Early uses include designing high-affinity protein binders targeting five disease-related proteins: EGFR and PDGFRβ in cancer pathways, PD-L1 and CTLA-4 as immune checkpoints, and CD45 in immune signaling. Laboratory validation showed a high success rate for these computationally designed molecules.

The atlas supports discovery of novel biology by enabling searches for structural similarities across distant evolutionary branches. Scientists can now explore metagenomic proteins at unprecedented scale, potentially identifying new enzymes or regulatory mechanisms relevant to biotechnology and medicine.

Implications for Academic and Research Communities

University laboratories and independent researchers gain immediate access to tools previously limited by computational barriers or licensing. The open release facilitates integration into existing workflows for structural biology, drug discovery, and synthetic biology programs.

PhD students and postdoctoral researchers in bioinformatics, computational biology, and related fields can train on or extend these models using publicly available code and data. This supports curriculum development in AI applications to life sciences and encourages collaborative projects across institutions.

Technical Accessibility and Infrastructure

Optimized kernels developed in collaboration with NVIDIA enable efficient inference on standard hardware. Researchers with moderate compute resources can process substantial portions of the atlas or generate new predictions without specialized supercomputing facilities.

The full suite—ESMC, ESMFold2, and the ESM Atlas—is hosted on the Biohub Platform for free access. A preprint detailing the methods and results accompanies the release, providing transparency for peer review and further development.

Photo by Mockup Free on Unsplash

Future Directions in Protein Biology Modeling

This release represents progress toward comprehensive world models of protein biology that integrate sequence, structure, and function. Continued scaling and refinement could enable programmable approaches to designing molecular tools for disease prevention and treatment.

Broader adoption may accelerate annotation of the protein universe, revealing functional insights from the vast majority of sequences that remain uncharacterized. Open ecosystems like this one promote reproducibility and innovation across the scientific community.

Broader Context in AI and Life Sciences

Protein structure prediction has evolved rapidly since the introduction of AlphaFold systems, which mapped nearly 200 million structures. The ESMFold2 atlas more than quadruples that scale while emphasizing openness and metagenomic diversity.

Such resources complement experimental efforts and support hybrid approaches where AI predictions guide targeted laboratory validation. The emphasis on evolutionary patterns learned from billions of sequences underscores the value of large-scale data in uncovering biological principles.

Browse by Subject

Frequently Asked Questions

🧬What is ESMFold2 and how does it work?

ESMFold2 is an open-source AI model developed by researchers at Chan Zuckerberg Biohub that predicts atomic-resolution protein structures directly from amino acid sequences. It leverages representations from the ESMC protein language model trained on billions of sequences across the tree of life.

📊How many proteins does the ESM Atlas cover?

The ESM Atlas includes predictions for 1.1 billion protein structures based on 6.8 billion sequences, primarily from metagenomic sources, making it the largest such resource released to date.

⚖️Is ESMFold2 better than AlphaFold 3?

Biohub benchmarks indicate ESMFold2 performs favorably, particularly on protein interactions and antibody design tasks, while offering full open-source access under the MIT license.

📅When was the ESMFold2 release announced?

The models and atlas were released on May 27, 2026, by the Chan Zuckerberg Biohub in San Francisco.

🔗Where can researchers access ESMFold2 and the atlas?

All components are freely available at the Biohub Platform under an MIT license, supporting immediate use in academic and industrial research settings.

🔬What are the main applications of the ESM Atlas?

The atlas enables discovery of evolutionary connections, annotation of uncharacterized proteins, and design of novel binders for disease targets such as cancer and immune pathways.

🎓How does open-source access benefit universities?

Open release allows labs worldwide to integrate the tools into teaching, research projects, and student training without licensing barriers, fostering collaboration in structural biology and AI.

🧪What is ESMC in the context of this release?

ESMC is the underlying protein language model trained on approximately 2.8 billion sequences that provides the evolutionary representations used by ESMFold2 for structure prediction.

⚡Does the model require multiple sequence alignments?

ESMFold2 can generate predictions directly from single sequences in many cases, offering speed advantages while maintaining high accuracy for diverse proteins.

🚀What future impact is expected from this release?

The open ecosystem supports accelerated discovery in fundamental biology, therapeutic design, and the development of programmable molecular tools across global research communities.