nvidia / genmol

Model Overview

Description:

GenMol is a masked diffusion model¹ trained on molecular Sequential Attachment-based Fragment Embedding (SAFE) representations² for fragment-based molecule generation, which can serve as a generalist model for various drug discovery tasks, including De Novo generation, linker design, motif extension, scaffold decoration/morphing, hit generation, and lead optimization.

This model is ready for commercial use.

License/Terms of Use:

This NIM is licensed under NVIDIA AI Foundation Models Community License Agreement. By using this NIM, you accept the terms and conditions of this license. You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.

References:

@misc{sahoo2024simpleeffectivemaskeddiffusion,
      title={Simple and Effective Masked Diffusion Language Models}, 
      author={Subham Sekhar Sahoo and Marianne Arriola and Yair Schiff and Aaron Gokaslan and Edgar Marroquin and Justin T Chiu and Alexander Rush and Volodymyr Kuleshov},
      year={2024},
      eprint={2406.07524},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.07524}, 
}

@misc{noutahi2023gottasafenewframework,
      title={Gotta be SAFE: A New Framework for Molecular Design}, 
      author={Emmanuel Noutahi and Cristian Gabellini and Michael Craig and Jonathan S. C Lim and Prudencio Tossou},
      year={2023},
      eprint={2310.10773},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2310.10773}, 
}

Model Architecture:

Architecture Type: Transformer

Network Architecture: BERT

Input:

Input Type(s): Text (Molecular Sequence), Number (Molecules to generate, SoftMax temperature scaling factor, randomness factor, diffusion step-size), Enumeration (Scoring method), Binary (Showing unique molecules only)

Input Format(s): Text: String (Sequential Attachment-based Fragment Embedding (SAFE)); Number: Integer, FP32; Enumeration: String (QED, LogP); Binary: Boolean

Input Parameters: 1D

Other Properties Related to Input: Maximum input length is 512 tokens.

Output:

Output Type(s): Text (List of molecule sequences), Number (List of scores)

Output Format: Text: Array of string (Sequential Attachment-based Fragment Embedding (SAFE)); Number: Array of FP32 (Scores)

Output Parameters: 2D

Other Properties Related to Output: Maximum output length is 512 tokens.

Software Integration:

Runtime Engine(s):
PyTorch >= 2.5.1

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere

NVIDIA Ada Lovelace

NVIDIA Hopper

NVIDIA Grace Hopper

[Preferred/Supported] Operating System(s):

Linux

Model Version(s):

GenMol v1.0

Training & Evaluation Dataset:

Training and Testing Dataset:

Link: SAFE-GPT GitHub, HuggingFace,

Data Collection Method by dataset: Automated

Labeling Method by dataset: Automated

Properties: 1.1B SAFE strings consist of various molecule types (drug-like compounds, peptides, multi-fragment molecules, polymers, reagents and non-small molecules).

Dataset License(s): CC-BY-4.0

Evaluation Dataset:

Link: SAFE-DRUGS GitHub, HuggingFace

Data Collection Method by dataset: Not Applicable

Labeling Method by dataset: Not Applicable

Properties: SAFE-DRUGS consists of 26 known therapeutic drugs.

Dataset License(s): CC-BY-4.0

Inference:

Engine: PyTorch

Test Hardware: A6000, A100, L40, L40S, H100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.