nvidia / nvclip

NVCLIP (Commercial Foundation Model)

Model Overview

NVCLIP is a NVIDIA commercial version of the "Contrastive Language-Image Pre-Training (CLIP)" model that transforms an image into three dimensional (3D) textual embeddings. This model is ready for commercial/non-commercial use.

References:

  • Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.

Model Architecture:

Architecture Type: Transformer-Based

NVCLIP as a backbone can be used towards various downstream tasks such as classification, detection, segmentation and text based image retrieval.

Input:

Input Type(s): Images, Texts

Input Format(s): List of Red, Green, Blue (RGB) Images or Strings

Other Properties Related to Input:

Channel Ordering of the Input: NCHW, where N = Batch Size, C = number of channels (3), H = Height of images (224), W = Width of the images (224)

Output:

Output Type(s): Float tensor

Output Format: 3D Tensor

Other Properties Related to Output:

The output of this model is an embedding of an input image or text of size 1024 for ViT-H variant.

Supported Operating System(s):

  • Linux

Model Version(s):

  • nv_clip_224_vit_h_trainable_v1.0 - NVCLIP ViT-H with 224 resolution is foundation model and is trainable.

Training & Evaluation:

This model can be used as a backbone and trained using the classification_pyt entrypoint in TAO. The training algorithm does a linear probe finetuning for classification task.

Using this Model

These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with the Train Adapt Optimize (TAO) Toolkit, or TensorRT.

The primary use case for these models is getting feature embeddings from images. These embeddings can then be used for curation, clustering, zero-shot or few-shot downstream tasks such as classification. These embeddings can also be used towards text based image retrieval.

These models are intended for training and fine-tuning using the TAO Toolkit and your datasets for image comparison. High-fidelity models can be trained on new use cases. A Jupyter Notebook is available as a part of the TAO container and can be used to re-training.

The models are also intended for edge deployment using TensorRT.

Training Dataset:

Data Collection Method by dataset:

  • Automated

Labeling Method by dataset:

  • Automated

Properties:

DatasetNo. of Images
NV Internal Data700M

Evaluation Dataset:

Link: https://www.image-net.org/

Data Collection Method by dataset:

  • Unknown

Labeling Method by dataset:

  • Unknown

Properties:

50,000 validation images from ImageNet dataset

Methodology and KPI

The performance of zero shot accuracy of NVCLIP on ImageNet validation dataset.

modeltop-1 Accuracy
ViT-H-2240.7786

Inference:

Engine: TensorRT-LLM with Triton

Test Hardware:

  • L40

Ethical Considerations:

Bias, Safety & Security, and Privacy

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here.

Special Training Data Considerations

The model was trained on publicly available data, which may contain toxic language and societal biases. Therefore, the model may amplify those biases, such as associating certain genders with specific social stereotypes.