ipd / proteinmpnn

Model Overview

Description:

ProteinMPNN (Protein Message Passing Neural Network) is a deep learning-based graph neural network designed to predict amino acid sequences for given protein backbones. This network leverages evolutionary, functional, and structural information to generate sequences that are likely to fold into the desired 3D structures.

This model is available for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case.

License / Terms of Use

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Service Terms of Use. Use of this model is governed by the NVIDIA Community Model License. Additional Information: MIT.

You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.

Deployment Geography

Global

Use Case

ProteinMPNN enables researchers and commercial entities in the Drug Discovery, Life Sciences, and Protein Engineering fields to design amino acid sequences that fold into desired 3D protein structures. It is particularly useful for de novo protein design, enzyme engineering, and therapeutic protein development.

Release Date

build.nvidia.com: August 13, 2025 via build.nvidia.com/ipd/proteinmpnn

NGC: August 13, 2025 via catalog.ngc.nvidia.com

References:

@article{dauparas2022robust,
  title={Robust deep learning--based protein sequence design using ProteinMPNN},
  author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courbet, Alexis and de Haas, Rob J and Bethel, Neville and others},
  journal={Science},
  volume={378},
  number={6615},  
  pages={49--56},
  year={2022},
  publisher={American Association for the Advancement of Science}
}

Model Architecture:

Architecture Type: Protein Amino Acid Sequence Prediction

Network Architecture: ProteinMPNN

Input:

Input Type(s): Protein in Protein Data Bank (PDB) format

Input Format(s): String

Input Parameters: One-Dimensional (1D)

Other Properties Related to Input: Accepts protein backbone structures in PDB format.

Output:

Output Type(s): Amino Acid Sequence

Output Format: Multi-FASTA (text file)

Output Parameters: 1D

Other Properties Related to Output: Generates sequences predicted to fold into the input backbone structure.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

Triton

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere, NVIDIA Hopper, NVIDIA Ada Lovelace, NVIDIA Blackwell

[Preferred/Supported] Operating System(s):

[Linux]
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

ProteinMPNN 1.0.0

Training & Evaluation:

Training Dataset:

** Data Modality

[Other: 3D Protein Structures]

Link: The Protein Data Bank

** Data Collection Method by dataset

Hybrid: Automatic, Human

For PDB dataset, scientists worldwide submit structural data determined by X-ray crystallography or cryo-electron microscopy (cryo-EM). This includes atomic coordinates, experimental data, and metadata about the biological macromolecules.

** Labeling Method by dataset

Hybrid: Automatic, Human

For PDB dataset, expert biocurators review the submitted data to ensure accuracy and completeness. This involves checking the plausibility of the data and annotating it with relevant biological and chemical information. CATH 4.1 dataset is derived from the PDB dataset. The CATH (Class, Architecture, Topology, Homologous superfamily) database hierarchically classifies protein domain structures that are obtained from protein structures deposited in the PDB. The data in CATH are specifically sourced from PDB files and include structures determined at a resolution of 4 angstrom or better. The classification process involves both manual and automated methods to ensure accurate domain identification and classification. For ProteinMPNN, the data underwent quality filtering to ensure high accuracy, this involved removing structures with low resolution and potential errors.

Properties:
Model was trained by Institute for Protein Design. The dataset for training consisted of 23,358 sequences. Dataset: CATH 4.2, PDB. Sensors: X-ray crystallography, cryoEM.

Evaluation Dataset:

Link: The Protein Data Bank

** Data Collection Method by dataset

Automatic: random splits from PDB dataset.

** Labeling Method by dataset

Automatic: random splits from PDB dataset.

Properties:
The training, validation, and test splits were derived from protein assemblies in the PDB, which includes structures determined by X-ray crystallography or cryo-electron microscopy (cryoEM). The dataset was divided into random splits with 23,358 sequences for training, 1,464 for validation, and 1,529 for testing.

Inference:

Acceleration Engine: Triton

Test Hardware:

NVIDIA H100
NVIDIA A100
NVIDIA L40S

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

You are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated, and comply with applicable safety regulations and ethical standards.

Get Help

Enterprise Support

Get access to knowledge base articles and support cases or submit a ticket.