Model Overview
Description:
ProteinMPNN (Protein Message Passing Neural Network) is a deep learning-based
graph neural network designed to predict amino acid sequences for given protein
backbones. This network leverages evolutionary, functional, and structural
information to generate sequences that are likely to fold into the desired 3D
structures.
Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed
and built to a third-party’s requirements for this application and use case; see
link to Non-NVIDIA Model Card.
References:
@article{dauparas2022robust,
title={Robust deep learning--based protein sequence design using ProteinMPNN},
author={Dauparas, Justas and Anishchenko, Ivan and Bennett, Nathaniel and Bai, Hua and Ragotte, Robert J and Milles, Lukas F and Wicky, Basile IM and Courbet, Alexis and de Haas, Rob J and Bethel, Neville and others},
journal={Science},
volume={378},
number={6615},
pages={49--56},
year={2022},
publisher={American Association for the Advancement of Science}
}
Model Architecture:
Architecture Type: Protein Amino Acid Sequence Prediction
Network Architecture: ProteinMPNN
Input:
Input Type(s): Protein in Protein Data Bank (PDB) format
Input Format(s): String
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input:
Output:
Output Type(s): Amino Acid Sequence
Output Format: Multi-FASTA (text file)
Output Parameters: 1D
Other Properties Related to Output:
Software Integration:
Runtime Engine(s):
- Triton
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Lovelace
- NVIDIA Turing
[Preferred/Supported] Operating System(s):
- Linux
Model Version(s):
ProteinMPNN 1.0.0
Training & Evaluation:
Training Dataset:
Link:
The Protein Data Bank
** Data Collection Method by dataset
- Hybrid: For PDB dataset, scientists worldwide submit structural data
determined by X-ray crystallography or cryo-electron microscopy (cryo-EM).
This includes atomic coordinates, experimental data, and metadata about the
biological macromolecules.
** Labeling Method by dataset
- Hybrid: For PDB dataset, expert biocurators review the submitted data to
ensure accuracy and completeness. This involves checking the plausibility of
the data and annotating it with relevant biological and chemical information.
CATH 4.1 dataset is derived from the Protein Data Bank (PDB) dataset. The CATH
(Class, Architecture, Topology, Homologous superfamily) database
hierarchically classifies protein domain structures that are obtained from
protein structures deposited in the PDB. The data in CATH are specifically
sourced from PDB files and include structures determined at a resolution of 4
angstrom or better. The classification process involves both manual and
automated methods to ensure accurate domain identification and classification.
For ProteinMPNN, the data underwent quality filtering to ensure high accuracy,
this involved removing structures with low resolution and potential errors.
Properties (Quantity, Dataset Descriptions, Sensor(s)): Model was trained by Institute for Protein Design. The dataset for training consisted of 23,358 sequences. Dataset: CATH 4.2, PDB. Sensors: X-ray crystallography, cryoEM.
Dataset License(s): CATH: CC BY 4.0, PDB: CC0 1.0.
Evaluation Dataset:
** Data Collection Method by dataset
- Automatic: random splits from PDB dataset.
** Labeling Method by dataset
- Automatic: random splits from PDB dataset.
The training, validation, and test splits were derived from protein assemblies
in the Protein Data Bank (PDB), which includes structures determined by X-ray
crystallography or cryo-electron microscopy (cryoEM). The dataset was divided
into random splits with 23,358 sequences for training, 1,464 for validation, and
1,529 for testing.
Inference:
Engine: Triton
Test Hardware:
- L40
- A100
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have
established policies and practices to enable development for a wide array of AI
applications. When downloaded or used in accordance with our terms of service,
developers should work with their supporting model team to ensure this model
meets requirements for the relevant industry and use case and addresses
unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns
here.