nvidia / molmim

Model Overview

Description:

MolMIM:

generates a random sample of new molecules in SMILES format by sampling from the latent space around the point corresponding to the given seed molecule.
performs optimization with the CMA-ES algorithmin the model’s latent space and sample molecules with improved values of the desired scoring function.

MolMIM is a latent variable model developed by NVIDIA that is trained in an unsupervised manner over a large-scale dataset of molecules in the form of SMILES strings. MolMIM utilizes transformer architecture to learn an informative fixed-size latent space using Mutual Information Machine (MIM) learning. MIM is a learning framework for a latent variable model which promotes informative and clustered latent codes. MolMIM can be used for sampling novel molecules from the model’s latent space.

References(s):

Improving Small Molecule Generation using Mutual Information Machine

MIM: Mutual Information Machine

The CMA Evolution Strategy: A Comparing Review

Model Architecture:

Architecture Type: Encoder-Decoder
Network Architecture: Perceiver

MolMIM utilizes a Perceiver encoder architecture which outputs a fixed-size representation, where molecules of various lengths are mapped into a latent space. MolMIM’s decoder architecture is a Transformer. Both encoder and decoder container 6 layers with a hidden size of 512, 8 attention heads, and a feed-forward dimension of 2048. Total number of parameters in MolMIM is 65.2M. The model was trained with A-MIM learning.

Input:

Input Type(s): Text (Molecular Sequence)

Input Format(s): Comma Separated Values, Simplified Molecular-Input Line Entry System (SMILES)

Input Parameters: 1D

Other Properties Related to Input: Maximum input length is 512 tokens. Pretraining dataset samples were randomly split into train, validation, and test sets ( 99% / 0.5% / 0.5% ).

Output:

Output Type(s): Text, Numerical

Output Format: [SMILES]

Output Parameters: [2D]

Other Properties Related to Output: Maximum output length is 128 tokens

Software Integration:

Runtime Engine(s):

Triton Inference Server

Supported Hardware Microarchitecture Compatibility:

Ampere
L40

Preferred/Supported Operating System(s):

[Linux]
[Windows]

Model Version(s):

MolMIM-24.03

Training and Evaluation Dataset:

Link: ZINC-15

** Data Collection Method by dataset

Not Applicable

** Labeling Method by dataset

Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): 1.54B molecules with molecular weight <= 500 Daltons, LogP <= 5, with reactivity levels rated as “reactive” and purchasability “annotated.” The compounds were filtered to ensure a maximum length of 512 characters.

Evaluation Dataset:

Link: MoleculeNet - Lipophilicity, FreeSolv, ESOL

** Data Collection Method by dataset

Hybrid: Human & Automatic/Sensors

** Labeling Method by dataset

Hybrid: Human & Automated

Properties (Quantity, Dataset Descriptions, Sensor(s)):

MoleculeNet Physical Chemistry is an aggregation of public molecular datasets. The physical chemistry portion of MoleculeNet that we used for evaluation is made up of ESOL (1128 compunds), FreeSolv (642 compunds) and Lipohilicity (4200 compunds).

Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, MoleculeNet: A Benchmark for Molecular Machine Learning, arXiv preprint, arXiv: 1703.00564, 2017.

From the MoleculeNet documentation:

ESOL is made up of water solubility data(log solubility in mols per litre) for common organic small molecules.
FreeSolv is made up of experimental and calculated hydration free energy of small molecules in water.
Lipophilicity is composed of experimental results of octanol/water distribution coefficient(logD at pH 7.4).

Inference:

Engine: Tensor(RT)

Test Hardware:

Ampere
L40

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards [Insert Link to Model Card++ here]. Please report security vulnerabilities or NVIDIA AI Concerns here.