Pretrained Grounding DINO with Commercial License

Description

Open vocabulary object detection is a computer vision technique that can detect one or multiple objects in a frame based on the text input. Object detection recognizes the individual objects in an image and places bounding boxes around the object. This model card contains pre-trained weights for the Grounding DINO object detection networks pretrained on the commercial dataset. Note that the model in this model card can be used for commercial purpose.

References

Zhang, H., Li, F., Liu, S., Zhang, L., Su, H., Zhu, J., Ni, L., Shum, H.: DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection.
Ziu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., Zhang, L.: Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: PRe-training of Deep Bidirectional Transformers for Language Understanding.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Transformer: Hierarchical Vision Transformer using Shifted Windows.

License

The licenses to use this model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses

Model Architecture

Architecture Type: Transformer-based Network Architecture

Network Architecture

Backbone: Window attention based Transformers called Swin-Tiny and BERT-Base text encoder.
Encoder: 6 layers of Multi-headed attention with deformable attention layers.
Decoders: 6 layers of Multi-headed attention with deformable attention layers.

More Details

The models in this instance are object detectors that take RGB images and list of phrases as input and produce bounding boxes and classes as output. More specifically, this model was trained with Swin-Tiny backbone that was trained using a supervised manner on NVIDIA proprietary data called NVImageNetv2, which allows commercial usage. In addition, BERT-Base was used as the starting weight for text tower. Finally, Grounding DINO was trained end-to-end on about 1.8M images that were collected from publicly available datasets. Note that we ensured that all the raw images used during training have commercial licenses to ensure safe commercial usage.

Input

Input Type(s): Image and list of captions of tokenized through HuggingFace

Input Format(s): Red, Green, Blue (RGB) and tokenized inputs. Can support any input resolution and images do not need any additional pre-processing (e.g. alpha channels or bits)

Input Parameters: Multiple dimensions. See below for detailed input shapes

Other Properties Related to Input:

inputs: B X 3 X 544 X 960 (Batch Size x Channel x Height x Width)
input_ids: B x 256 (Batch Size x Max Token Length )
attention_mask: B x 256 (Batch Size x Max Token Length )
position_ids: B x 256 (Batch Size x Max Token Length )
token_type_ids: B x 256 (Batch Size x Max Token Length )
text_token_mask: B x 256 x 256 (Batch Size x Max Token Length x Max Token Length)
Because ONNX / TensorRT can't take string as input, we've offloaded tokenizer outside of the model graph. See TAO-Deploy repo on running tokenization through HuggingFace.

Output

Output Type(s): Bounding Boxes and Confidence Scores for each detected object in the input image.

Output Formats: One Dimensional (1D), Two Dimensional (2D) vectors

Other Properties Related to Output:

pred_logits: B x 900 (Batch Size x Number of Queries)
pred_boxes: B x 900 x 4 (Batch Size x Number of Queries x Coordinates in cxcywh format)

Software Integration

Runtime Engine(s):

TAO - 5.5.0

Supported Hardware Architecture(s):

NVIDIA Ampere
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

Supported Operating System(s):

Linux
Linux 4 Tegra

Model Versions

grounding_dino_swin_tiny_commercial_trainable_v1.0 - Pre-trained Swin-Tiny Grounding DINO model for finetune.
grounding_dino_swin_tiny_commercial_deployable_v1.0 - Swin-Tiny Grounding DINO model deployable.

Training & Evaluation:

This model was trained using the grounding_dino entrypoint in TAO. The training algorithm optimizes the network to minimize the localization and contrastive embedding loss between text and visual features.

Using this Model

These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any NVIDIA GPU including NVIDIA Jetson devices. These models can only be used with Train Adapt Optimize (TAO) Toolkit, or TensorRT.

The intended use for these models is detecting objects in a color (RGB) image. The model can be used to detect objects from photos and videos by using appropriate video or image decoding and pre-processing.

These models are intended for training and fine-tune with the TAO Toolkit and your datasets for object detection. High-fidelity models can be trained with new use cases. A Jupyter Notebook is available as a part of the TAO container and can be used to re-train.

The models are also intended for easy edge deployment using TensorRT.

Training Dataset

Training Data

Grounding DINO was pretrained on wide range of commercial datasets where the annotations were either human generated or pseudo-labeled. The model was trained on 1,815,237 images and 14,794,974 instances of both object detection (OD) and grounding annotations. Please refer to below section for details of every dataset used to train Grounding DINO.

Dataset	Data Collection Method by dataset	Labeling Method by dataset	# of Images	# of Annotations
Subset of OpenImagesv5	Unknown	Automated. Pseudo-labeled raw images with Objects365 trained CO-DETR.	803,826	7,345,546
Localized Narrative OpenImages	Unknown	Automated. Pseudo-labeled raw images and global captions with Grounding DINO.	670,553	6,098,908
Subset of LVIS	Unknown	Human-labeled (only contains commercial subset).	30,740	391,840
Subset of Mixed Grounding	Unknown	Human-labeled (only contains commercial subset).	150,668	777,178
Subset of RefCOCO	Unknown	Human-labeled (only contains commercial subset).	36,459	36,459
Subset of RefCOCO+	Unknown	Human-labeled (only contains commercial subset).	36,302	36,302
Subset of RefCOCOg	Unknown	Human-labeled (only contains commercial subset).	23,718	23,718
Subset of gRefCOCO	Unknown	Human-labeled (only contains commercial subset).	62,971	85,023

Evaluation Data

Data Collection Method by dataset:

Unknown

Labeling Method by dataset:

Human.

Properties:

COCO validation dataset with 5,000 images.

Methodology and KPI

The key performance indicator is the mean average precision (mAP), following the standard evaluation protocol for object detection. The KPI for the evaluation data are:

model	precision	mAP	mAP50	mAP75	mAPs	mAPm	mAPl
grounding_dino_swin_tiny	BF16	46.1	59.9	51.0	30.5	49.3	60.8

Model Limitation

Grounding DINO was trained on images collected from the web and text data of everyday noun phrases. The model might not perform well on different data distributions. Conducting further fine-tuning on the target domain is recommended to get a higher mAP.

Technical Blogs

Train like a ‘pro’ without being an AI expert using TAO AutoML
Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning
Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper
Customize Action Recognition with TAO and deploy with DeepStream
Read the two-part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
Learn how to train a real-time License plate detection and recognition app with TAO and DeepStream.
Model accuracy is extremely important; learn how you can achieve state of the art accuracy for classification and object detection models using TAO.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

nvidia / nv-grounding-dino