meta / esm2-650m

Model Overview

Description:

ESM2 is a state of the art language model for protein sequences. It outputs a numerical representation of input protein sequences that is suitable for downstream tasks. In particular, its output is used for protein folding in the ESMFold model. ESM2 comes in various sizes: 650M parameters, 3B parameters, and 15B parameters. ESM2 was developed and trained by META. The larger models tend to result in more accurate results on downstream tasks, however they also have a longer runtime.

License

MIT License

Copyright (c) Meta Platforms, Inc. and affiliates.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

This model is available for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA Model Card.

References:

@ARTICLE{rives2021esm,
    title    = "Biological structure and function emerge from scaling
                unsupervised learning to 250 million protein sequences",
    author   = "Rives, Alexander and Meier, Joshua and Sercu, Tom and Goyal,
                Siddharth and Lin, Zeming and Liu, Jason and Guo, Demi and Ott,
                Myle and Zitnick, C Lawrence and Ma, Jerry and Fergus, Rob",
    journal  = "Proc. Natl. Acad. Sci. U. S. A.",
    volume   =  118,
    number   =  15,
    month    =  apr,
    year     =  2021,
    keywords = "deep learning; generative biology; protein language model;
                representation learning; synthetic biology",
    language = "en",
    doi = {10.1073/pnas.2016239118}
}


For the self-attention contact prediction:

@article{rao2020transformer,
  author =  {Rao, Roshan M and Meier, Joshua and Sercu,
            Tom and Ovchinnikov, Sergey and Rives, Alexander},
  title=    {Transformer protein language models are unsupervised
            structure learners},
  year=     {2020},
  doi=      {10.1101/2020.12.15.422761},
  url=      {https://www.biorxiv.org/content/10.1101/2020.12.15.422761v1},
  journal=  {bioRxiv}
}


Model Architecture:

Architecture Type: BERT with rotational embeddings

Network Architecture: ESM2-650m

Input:

Input Type(s): Protein Sequence

Input Format(s): String

Input Parameters: 1D

Other Properties Related to Input: Protein Sequence matching the regular expression ^[ARNDCQEGHILKMFPSTWYVXBOU]*$ upto 1024 characters

Output:

Output Type(s): Embeddings

Output Format: Float 16 Array

Output Parameters: 1D

Other Properties Related to Output: NA

Software Integration:

Runtime Engine(s):

  • [Not Applicable (N/A)- Name Platform If Multiple]

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • L40

[Preferred/Supported] Operating System(s):

  • Linux

Model Version(s):

ESM2

Training & Evaluation:

Training Dataset:

Link:
UniRef50

** Data Collection Method by dataset

  • Not Applicable

** Labeling Method by dataset

  • Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): UniRef50, September 2021 version, is used for the training of ESM models. The training dataset was partitioned by randomly selecting 0.5% (≈ 250,000) sequences to form the validation set. The training set has sequences removed via the procedure described

Evaluation Dataset:

UniRef50

** Data Collection Method by dataset

  • Not Applicable

** Labeling Method by dataset

  • Not Applicable

Properties (Quantity, Dataset Descriptions, Sensor(s)): UniRef50, September 2021 version, is used for the training of ESM models. The training dataset was partitioned by randomly selecting 0.5% (≈ 250,000) sequences to form the validation set. The training set has sequences removed via the procedure described

Inference:

Engine: Triton

Test Hardware:

  • [Other (Not Listed)]

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

**If anything is meant for internal-purposes only (including this statement and pre-filled content recommendations, please alert Trustworthy AI Product Manager or designee before publishing)