USD Search

Model Overview

USD Search is an AI-powered search for OpenUSD data, three dimensional (3D) models, images, and assets using text or image-based inputs. It leverages NVCLIP, which is a NVIDIA commercial version of the "Contrastive Language-Image Pre-Training (CLIP)" model that transforms an image into textual embeddings.

References:

Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.

Model Architecture:

Architecture Type: Transformer-based architecture

Input:

Input Type(s): Text or Image

Input Format(s): Text, or Red, Green, Blue (RGB)

Other Properties Related to Input: The model accepts either text or image input, but not both simultaneously

Output:

Output Type(s): List

Output Format: Rendered thumbnails, asset metadata

Other Properties Related to Output:

The output of this model is a sorted-by-relevance list of OpenUSD assets. List contains rendered thumbnails and associated metadata containing URL pointing to the location of the asset in the backend database.

Software Integration:

Runtime Engine(s):

TensorRT

Supported Hardware Architecture(s):

NVIDIA Ampere
NVIDIA Hopper
NVIDIA Lovelace

Supported Operating System(s):

Linux

Model Version(s):

nv_clip_224_vit_h - NVCLIP ViT-H with 224 resolution.

Training & Evaluation:

No additional training or evaluation in addition to what has been done for the NVCLIP model.

Using this Model

These models need to be used with NVIDIA hardware and software. For hardware, the models can run on any of the latest NVIDIA GPUs since NVIDIA Ampere.

Training Dataset:

Data Collection Method by dataset:

Automated

Labeling Method by dataset:

Automated

Properties:

Dataset	No. of Images
NV Internal Data	700M

Evaluation Dataset:

Link: https://www.image-net.org/

Data Collection Method by dataset:

Unknown

Labeling Method by dataset:

Unknown

Properties:

50,000 validation images from ImageNet dataset

The performance details of the underlying NVCLIP model is noted below.

Methodology and KPI

The performance of zero shot accuracy of NVCLIP on ImageNet validation dataset.

model	top-1 Accuracy
ViT-H-336	0.7786
ViT-L-336	0.7629

Inference:

Engine: TensorRT

Test Hardware:

A100
L40

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.