nvidia / genmol

Model Overview

Description:

GenMol v2.0 (NV-GenMol-89M-v2) is a masked diffusion model1 trained on molecular Sequential Attachment-based Fragment Embedding (SAFE) representations2 for fragment-based molecule generation, which can serve as a generalist model for various drug discovery tasks, including De Novo generation​, linker design​, motif extension​, scaffold decoration/morphing​, hit generation​, and lead optimization. NV-GenMol-89M-v2 was developed by NVIDIA as part of the BioNeMo NIM family.

This model is ready for commercial use.

License/Terms of Use:

Use of this model is governed by the NVIDIA Open Model License.

You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.

Deployment Geography:

Global

Use Case:

GenMol is intended for computational chemists, drug discovery researchers, and AI/ML scientists performing fragment-based molecular generation tasks including de novo generation, linker design, motif extension, scaffold decoration/morphing, hit generation, and lead optimization.

Release Date:

References:

@misc{sahoo2024simpleeffectivemaskeddiffusion,
      title={Simple and Effective Masked Diffusion Language Models}, 
      author={Subham Sekhar Sahoo and Marianne Arriola and Yair Schiff and Aaron Gokaslan and Edgar Marroquin and Justin T Chiu and Alexander Rush and Volodymyr Kuleshov},
      year={2024},
      eprint={2406.07524},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.07524}, 
}
@misc{noutahi2023gottasafenewframework,
      title={Gotta be SAFE: A New Framework for Molecular Design}, 
      author={Emmanuel Noutahi and Cristian Gabellini and Michael Craig and Jonathan S. C Lim and Prudencio Tossou},
      year={2023},
      eprint={2310.10773},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2310.10773}, 
}

Model Architecture:

Architecture Type: Transformer

Network Architecture: BERT

Number of model parameters: 89M

Input(s):

Input Type(s): Text (Molecular Sequence), Number (Molecules to generate, temperature scaling factor, noise scaling factor), Enumeration (Scoring method), Binary (Showing unique molecules only)

Input Format(s): Text: String (Sequential Attachment-based Fragment Embedding (SAFE)); Number: Integer, FP32; Enumeration: String (QED, LogP); Binary: Boolean

Input Parameters: 1D

Other Properties Related to Input: Maximum input length is 512 tokens.

Output(s):

Output Type(s): Text (List of molecule sequences), Number (List of scores)

Output Format(s): Text: Array of string (Sequential Attachment-based Fragment Embedding (SAFE)); Number: Array of FP32 (Scores)

Output Parameters: 2D

Other Properties Related to Output: Maximum output length is 512 tokens.

Software Integration:

Runtime Engine(s):

  • PyTorch >= 2.5.1

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Lovelace
  • NVIDIA Hopper
  • NVIDIA Grace Hopper
  • NVIDIA Blackwell
  • NVIDIA Grace Blackwell

Supported Operating System(s):

  • Linux

Model Version(s):

GenMol v2.0

Training, Testing, and Evaluation Datasets:

Training Dataset:

Link: SAFE-GPT GitHub, HuggingFace

Data Collection Method by dataset: Automated

Labeling Method by dataset: Automated

Properties: 1.1B SAFE strings consist of various molecule types (drug-like compounds, peptides, multi-fragment molecules, polymers, reagents and non-small molecules).

Testing Dataset:

Link: SAFE-GPT GitHub, HuggingFace

Data Collection Method by dataset: Automated

Labeling Method by dataset: Automated

Properties: 1.1B SAFE strings consist of various molecule types (drug-like compounds, peptides, multi-fragment molecules, polymers, reagents and non-small molecules).

Evaluation Dataset:

Link: SAFE-DRUGS GitHub, HuggingFace

Data Collection Method by dataset: Not Applicable

Labeling Method by dataset: Not Applicable

Properties: SAFE-DRUGS consists of 26 known therapeutic drugs.

Inference:

Acceleration Engine: PyTorch

Test Hardware: NVIDIA RTX 6000 Ada, NVIDIA A10G, NVIDIA A100, NVIDIA L40S, NVIDIA H100, NVIDIA H200, NVIDIA GH200, NVIDIA B200, NVIDIA GB200, NVIDIA B300, NVIDIA GB300, NVIDIA RTX 6000 Blackwell Workstation, NVIDIA DGX Spark

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for ensuring the physical properties of model-generated molecules are appropriately evaluated and comply with applicable safety regulations and ethical standards.

For more detailed information on ethical considerations for this model, please see the Model Card, click on the Model Card ++ link at the top, then click on the links for Bias, Explainability, Safety & Security, and Privacy subcards.

Please report security vulnerabilities or NVIDIA AI Concerns here.

country_code