nvidia / nv-dinov2

TAO Commercial Pretrained NV-Dinov2 Model

Model Overview

This model card contains pretrained weights of NV-Dinov2 model which can be used as a backbone for most of the popular computer vision tasks such as Classification, Segmentation, Detection. This model is ready for commercial use.

These weights that may be used as a starting point with the classification, segmentation, detection, change detection applications in Train Adapt Optimize (TAO) Toolkit to facilitate transfer learning.

References

  • Oquab, Maxime, et al. "Dinov2: Learning robust visual features without supervision." arXiv preprint arXiv:2304.07193 (2023).

Model Architecture

Architecture Type: Transformer-Based

NV-Dinov2 is a visual foundational model trained on NVIDIA proprietary large scale dataset. Dinov2 is a self-supervised learning method that uses a combination of two SSL techniques : DINO and iBOT. These models simplify the use of images in any system by producing all purpose visual features, that is, features that work across image distributions and tasks without finetuning. Trained on large curated datasets, our model has learnt robust fine-grained representation useful for localization and classification tasks. This model can be used as a foundation model for a variety of downstream tasks with few labeled examples. For more details on the method see: Dinov2.

Input:

Input Types: Images

Input Formats: Red, Green, Blue (RGB)

Input Parameters: Three-Dimensional (3D)

Other Properties Related to Input:

Minimum Resolution: 224 x 224

Maximum Resolution: 518 x 518

Alpha Channel: No alpha channel

  • Input image format: RGB Image of dimensions: 224 X 224 X 3 (H x W x C)

Note: ViT-G was fine-tuned for high-resolution images. It works for any input resolutions between 224 X 224 X 3 -> 518 x 518 x 3.
Channel Ordering of the Input: NCHW, where N = Batch Size, C = number of channels (3), H = Height of images (336), W = Width of the images (336)

Output:

Output Types: Embedding - Float tensor

Output Format: 3D Vector

Other Properties Related to Output:

Alpha Channel: No alpha channel

The output of this model is an embedding of an input image of size 1024 for ViT-L variant and 1536 for ViT-G.

Software Integration:

Runtime Engines:

  • TAO - 5.2

Supported Hardware Architectures:

  • NVIDIA Ampere
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Pascal
  • NVIDIA Turing
  • NVIDIA Volta

Supported Operating Systems:

  • Linux
  • Linux 4 Tegra

Model Versions:

  • NV-DinoV2-224-V0 - 130M internal dataset pre-trained model at 224 resolution.
  • NV-DinoV2-518-V0 - 700M internal dataset pre-trained model at 518 resolution.

Training & Evaluation

This model was trained using our implementation of DINOV2 on NVIDIA-commercial dataset.

Using this Model

These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with the Train Adapt Optimize (TAO) Toolkit, or TensorRT.

The primary use case for these models is getting feature embeddings from images. These embeddings can then be used for downstream tasks such as classification, segmentation, and detection by adding relevant heads.

These models are intended for training and fine-tuning using the TAO Toolkit and your datasets for image comparison. High-fidelity models can be trained on new use cases. A Jupyter Notebook is available as a part of the TAO container and can be used to re-training.

The models are also intended for edge deployment using TensorRT.

Using the Model with TAO

To use these models as pretrained weights for transfer learning, use the snippet below as a template for the model and train component of the experiment spec file to train a Dinov2 Classification model. For more information on the experiment spec file, see the TAO Toolkit User Guide.

For ViT-L NV-Dinov2:

model:
  init_cfg:
    checkpoint: None
  backbone:
    type: vit_large_patch14_dinov2_swiglu
    pretrained: /path/to/nvdinov2.pth
    freeze: True
  head:
    type: TAOLinearClsHead

Training Dataset

Data Collection Method by dataset:

  • Automated

Labeling Method by dataset:

  • Automated

Properties:

DatasetNo. of Images
NV Internal Data130M
NV Internal Data700M

Evaluation Dataset:

Link: https://www.image-net.org/

Data Collection Method by dataset:

  • Unknown

Labeling Method by dataset:

  • Unknown

Properties:

50,000 validatio images from ImageNet dataset

Methodology and KPI

The key performance indicator is the accuracy, following the standard evaluation protocol for image classification. The KPI for the evaluation data are reported below.

modeltop-1 accuracy
ViT-L NV-Dinov2 ImageNet validation79.9
ViT-G NV-Dinov2 ImageNet validation80.4

Inference:

Engine: Tensor(RT)

Test Hardware:

  • A2
  • A30
  • DGX H100
  • DGX A100
  • L4
  • L40
  • NVIDIA T4
  • AGX Orin 64GB
  • Orin NX16GB
  • T4

The inference is run on the provided unpruned model at FP16 precision. The inference performance is run using trtexec on Orin, Orin NX and NVIDIA T4, and Ampere GPUs. The Jetson devices are running at Max-N configuration for maximum GPU frequency. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might vary depending on other bottlenecks in the hardware and software.

NVDinoV2 (224x224 resolution)

PlatformBSFPS
Orin NX 16GB1631.55
AGX Orin 64GB1681.41
A21672.7
T44110.3
A3016461.0
L44275.0
L408579.0
A100321031.0
H100642500.6

Technical Blogs

Suggested Reading

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.