colabfold / msa-search

Model Overview

Description:

The MSA Search NIM is powered by GPU MMSeqs2. GPU MMSeqs2 is a GPU-accelerated toolkit for protein database search, Multiple Sequence Alignment (MSA), and Structural Template Search. While not a deep learning model, MMSeqs2 does require large protein databases for sequence similarity search and structural template discovery.

The container components are ready for commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case. ColabFold was developed by the authors of Mirdita et al. 2022. GPU MMSeqs2 was developed by the authors of Kallenborn et al. 2025.

License / Terms of Use

GOVERNING TERMS:

API Catalog:

The trial service is governed by the NVIDIA API Trial Terms of Service.

NIM Container:

The NIM container is governed by the NVIDIA Software License Agreement and Product-Specific Terms for AI Products.

Deployment Geography

Global

Use Case

The MSA Search NIM enables researchers and commercial entities in the Drug Discovery, Life Sciences, and Digital Biology fields to rapidly generate multiple sequence alignments (MSA) and find structural templates from PDB databases. The output MSA and template structures can be used in downstream protein structure prediction and evolutionary analysis applications.

Release Date

1.0.0

Build.nvidia.com March 16, 2025 via build.nvidia.com/colabfold/msa-search

NGC March 16, 2025

References:

@ARTICLE{jumper2021alphafold,
    title    = "Highly accurate protein structure prediction with {AlphaFold}",
    author   = "Jumper, John and Evans, Richard and Pritzel, Alexander and Green,
                Tim and Figurnov, Michael and Ronneberger, Olaf and
                Tunyasuvunakool, Kathryn and Bates, Russ and {\v Z}{\'\i}dek,
                Augustin and Potapenko, Anna and Bridgland, Alex and Meyer,
                Clemens and Kohl, Simon A A and Ballard, Andrew J and Cowie,
                Andrew and Romera-Paredes, Bernardino and Nikolov, Stanislav and
                Jain, Rishub and Adler, Jonas and Back, Trevor and Petersen, Stig
                and Reiman, David and Clancy, Ellen and Zielinski, Michal and
                Steinegger, Martin and Pacholska, Michalina and Berghammer, Tamas
                and Bodenstein, Sebastian and Silver, David and Vinyals, Oriol
                and Senior, Andrew W and Kavukcuoglu, Koray and Kohli, Pushmeet
                and Hassabis, Demis",
    journal  = "Nature",
    volume   =  596,
    number   =  7873,
    pages    = "583--589",
    month    =  aug,
    year     =  2021,
    language = "en",
    doi = {10.1038/s41586-021-03819-2},
}

@ARTICLE{mirdita2022colabfold,
    title   = "ColabFold: making protein folding accessible to all",
    author  = "Mirdita, Milot and Sch{\"u}tze, Konstantin and Moriwaki, Yoshitaka and Heo, Lim and Ovchinnikov, Sergey and Steinegger, Martin",
    journal = "Nature Methods",
    volume  = 19,
    number  = 6,
    pages   = "679--682",
    month   = jun,
    year    = 2022,
    language = "en",
    doi     = {10.1038/s41592-022-01488-1},
}

@ARTICLE{kallenborn2025gpu,
    title   = "GPU-accelerated homology search with MMseqs2",
    author  = "Kallenborn, Felix and Chacon, Alejandro and Hundt, Christian and Sirelkhatim, Hassan and Didi, Kieran and Cha, Sooyoung and Dallago, Christian and Mirdita, Milot and Schmidt, Bertil and Steinegger, Martin",
    journal = "bioRxiv",
    year    = 2025,
    month   = jan,
    day     = 20,
    language = "en",
    doi     = {10.1101/2024.11.13.623350},
}

Model Architecture:

Architecture Type: Not Applicable

Network Architecture: Not Applicable

Input:

Input Type(s): Protein Sequence, Databases, Structural Template Databases

Input Format(s): String (less than or equal to 4096 characters), Constrained List of Strings (one or more valid database names)

Input Parameters: String: 1D; Constrained List of Strings: 1D

Other Properties Related to Input: NA

Output:

Output Type(s): Multiple Sequence Alignment in A3M or FASTA format; Structural templates in mmCIF format

Output Format: A3M or FASTA (text file); mmCIF (text file)

Output Parameters: 1D

Other Properties Related to Output: N/A

Software Integration:

Runtime Engine(s):

Python, C++, CUDA

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere, NVIDIA Hopper, NVIDIA Ada Lovelace

[Preferred/Supported] Operating System(s):

[Linux]

Model Version(s):

MSA NIM container downloads following NGC models:

nim/colabfold/msa-search:uniref30_2302-m18v1

nim/colabfold/msa-search:pdb_20251028_zip-m18v1

nim/colabfold/msa-search:pdb70_220313-m18v1

nim/colabfold/msa-search:pdb100_230517-m18v1

nim/colabfold/msa-search:colabfold_envdb_202108-m18v1

Training & Evaluation:

Not Applicable.

Training Dataset:

Link: Not Applicable.

** Data Collection Method by dataset

[Not Applicable]

** Labeling Method by dataset

[Not Applicable]

Properties:
Not Applicable.

Evaluation Dataset:

Link: Not Applicable.

** Data Collection Method by dataset

[Not Applicable]

** Labeling Method by dataset

[Not Applicable]

Properties:
Not Applicable

Inference:

Engine: Python, C++, CUDA

Test Hardware:

NVIDIA B200
NVIDIA A6000 Ada
NVIDIA A100
NVIDIA L40
NVIDIA H100

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer team to ensure these software components meet requirements for the relevant industry and use case and address unforeseen product misuse.

Please report quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.