mit / diffdock

Model Overview

Description:

DiffDock is a generative diffusion model for drug discovery in molecular blind docking.

DiffDock consists of two models: the Score and Confidence models. The Score model generates a series of potential poses for protein-ligand binding by running a reverse diffusion process.

DiffDock does not require any information about a binding pocket. During its diffusion process, the molecule's position relative to the protein, its orientation, and the torsion angles are allowed to change. Running the learned reverse diffusion process transforms a distribution of noisy prior molecule poses to the one learned by the model. As a result, it outputs many sampled poses and ranks them via its confidence model.

Leveraging the same neural-network architecture designed in the original DiffDock by MIT, the model v2.0 is trained by NVIDIA using PLINDER, a state-of-art dataset of well curated and labeled protein-ligand complexes, which therefore, delivers a much higher accuracy for molecular docking tasks.

This model is ready for commercial and non-commercial use.

License/Terms of Use:

This model is released under the MIT License.

References:

@article {Durairaj2024.07.17.603955,
	author = {Durairaj, Janani and Adeshina, Yusuf and Cao, Zhonglin and Zhang, Xuejin and Oleinikovas, Vladas and Duignan, Thomas and McClure, Zachary and Robin, Xavier and Studer, Gabriel and Kovtun, Daniel and Rossi, Emanuele and Zhou, Guoqing and Veccham, Srimukh and Isert, Clemens and Peng, Yuxing and Sundareson, Prabindh and Akdel, Mehmet and Corso, Gabriele and St{\"a}rk, Hannes and Tauriello, Gerardo and Carpenter, Zachary and Bronstein, Michael and Kucukbenli, Emine and Schwede, Torsten and Naef, Luca},
	title = {PLINDER: The protein-ligand interactions dataset and evaluation resource},
	elocation-id = {2024.07.17.603955},
	year = {2024},
	doi = {10.1101/2024.07.17.603955},
	publisher = {Cold Spring Harbor Laboratory},
	abstract = {Protein-ligand interactions (PLI) are foundational to small molecule drug design. With computational methods striving towards experimental accuracy, there is a critical demand for a well-curated and diverse PLI dataset. Existing datasets are often limited in size and diversity, and commonly used evaluation sets suffer from training information leakage, hindering the realistic assessment of method generalization capabilities. To address these shortcomings, we present PLIN-DER, the largest and most annotated dataset to date, comprising 449,383 PLI systems, each with over 500 annotations, similarity metrics at protein, pocket, interaction and ligand levels, and paired unbound (apo) and predicted structures. We propose an approach to generate training and evaluation splits that minimizes task-specific leakage and maximizes test set quality, and compare the resulting performance of DiffDock when retrained with different kinds of splits.Competing Interest StatementThe authors have declared no competing interest.},
	URL = {https://www.biorxiv.org/content/early/2024/07/19/2024.07.17.603955.1},
	eprint = {https://www.biorxiv.org/content/early/2024/07/19/2024.07.17.603955.1.full.pdf},
	journal = {bioRxiv}
}
@article{corso2023diffdock,
      title={DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking}, 
      author = {Corso, Gabriele and Stärk, Hannes and Jing, Bowen and Barzilay, Regina and Jaakkola, Tommi},
      journal={International Conference on Learning Representations (ICLR)},
      year={2023}
}

Model Architecture:

Architecture Type: Score-Based Diffusion Model (SBDM)

Network Architecture: Graph Convolution Neural Network

The Score model is a 3-dimensional equivariant graph neural network that has three layers: embedding, interaction layer with 6 graph convolution layers, and output layer. In total, the Score model has 20M parameters.

Input:

Input Type(s): Text (Ligand, Protein), Number (Poses to Generate, Batch Size, Diffusion Steps, Diffusion Time Divisions) Binary (No Final Step Noise, Save Diffusion Trajectory, and Skip Gen Conformer)

Input Format(s): Text: String (SMILES, Structural Data Files (SDF) or Tripos molecule structure (Mol2) for Ligand), String (Protein Data Bank (PDB)), Number: Integer; Binary: Boolean

Input Parameters: 1D

Other Properties Related to Input: No max sequence

Output:

Output Type(s): Text (Ligand Molecule 3D Positions, 3D), Text (Ligand Molecule 3D Positions, 3D), Number (List of Confidence Scores, 1D)

Output Format: Text: Structural Data Files (SDF), Text: Protein Data Bank (PDB), Number: Array of Floating Point 32

Output Parameters: docked_ligand, visualizations_files, pose_confidence

Software Integration:

Runtime Engine(s):

  • Triton

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Ada Lovelace
  • NVIDIA Hopper
  • NVIDIA Grace Hopper

[Preferred/Supported] Operating System(s):

  • Linux

Model Version(s):

DiffDock v2.0

Training & Evaluation Dataset:

Training:

Link: PLINDER


Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • Hybrid: Human & Automated

Properties (Quantity, Dataset Descriptions, Sensor(s)): 450,000 protein-ligand complexes automatically curated using the PDB database. See (https://www.biorxiv.org/content/10.1101/2024.07.17.603955v3).

Evaluation:

Link: PoseBusters benchmark (PDB) set


Data Collection Method by dataset:

  • Human

Labeling Method by dataset:

  • Hybrid: Human & Automated

Properties (Quantity, Dataset Descriptions, Sensor(s)): 428 protein-ligand complexes manually curated using the PDB database. See https://arxiv.org/abs/2308.05777v1.

Inference:

Engine: Triton

Test Hardware:

  • A6000, A100, L40, L40S, H100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here].

Please report security vulnerabilities or NVIDIA AI Concerns here.