Retail Object Detection
Description:
RetailObjectDetection detects retail items within an image; it classifies objects as retail or not.
This model is ready for commercial use.
References:
- Tobin, Josh, et al. "Domain randomization for transferring deep neural networks from simulation to the real world." 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, 2017.
- Morrical, Nathan, et al. "NViSII: A scriptable tool for photorealistic image generation." arXiv preprint arXiv:2105.13962 (2021).
- Get TAO Container
- Get other purpose-built models from the NGC model registry:
- TrafficCamNet
- PeopleNet
- PeopleNet-Transformer
- DashCamNet
- FaceDetectIR
- VehicleMakeNet
- VehicleTypeNet
- PeopleSegNet
- PeopleSemSegNet
- License Plate Detection
- License Plate Recognition
- PoseClassificationNet
- Facial Landmark
- FaceDetect
- 2D Body Pose Estimation
- ActionRecognitionNet
- People ReIdentification
- PointPillarNet
- CitySegFormer
- Retail Object Detection
- Retail Object Embedding
- Optical Inspection
- Optical Character Detection
- Optical Character Recognition
- PCB Classification
- PeopleSemSegFormer
Model Architecture:
Architecture Type: Convolution Neural Network (CNN)
Network Architecture: EfficientDet, DINO-FAN_base
Input:
Input Type(s): Image
Input Format(s): Red, Green, Blue (RGB)
Input Parameters: 2D
Other Properties Related to Input: RGB Fixed Resolution: 416x416 and 960x544 (W x H); No minimum bit depth, alpha, or gamma.
The logos on retail items were smudged.
Output:
Output Type(s): Label(s)
Output Format: Label: Text String
Other Properties Related to Output: Category Label(s): returns a single category.
The logos on retail items were smudged.
Software Integration:
Runtime Engine(s):
- TAO - 5.2
- DeepStream 6.1 or later
Supported Hardware Architecture(s):
- Ampere
- Jetson
- Hopper
- Lovelace
- Pascal
- Turing
- Volta
Supported Operating System(s):
- Linux
- Linux 4 Tegra
Model Version(s):
- trainable_binary_v1.0
- deployable_binary_v1.0
Training & Evaluation:
Training Dataset:
Data Collection Method by dataset:
- Automatic/Sensors
Labeling Method by dataset:
- Human
Properties:
320,000 proprietary synthetic images of objects found in retail settings randomizing several simulation domains including:
- light types, light intensities
- object sizes, orientations, and locations
- camera locations
- background textures
- flying distractors
The background textures are real images sampled from:
- Proprietary real images
- images taken from a retail checkout counter
- HDRI texture maps create by NVIDIA Omniverse
Each synthetic image contains 1 target retail item. This dataset is set up to simulate the diverse environments in the real world and to have the detector learn to extract retail items from noisy backgrounds. The logos on retail items were smudged.
dataset | total #images | train #images | val #images |
---|---|---|---|
Synthetic data | 1,500,000 | 1,425,000 | 75,000 |
Real data - checkout counter 45 overhead | 107 | 85 | 23 |
Real data - shelf | 107 | 85 | 22 |
Real data - conveyor belt | 106 | 84 | 22 |
Real data - basket | 106 | 84 | 22 |
Real data - checkout counter barcode scanner view | 125 | 100 | 25 |
Real data - checkout counter overhead | 98 | 80 | 18 |
Fine-tuning Data
This model is fine-tuned on about 600 real proprietary images from 6 different real environments. In each environment, only 1 image per item is collected.
The fine tuning data are captured under random camera heights and field of views. All fine tuning data were collected indoor, having retail items placed at the checkout counter, shelf, baskets, and conveyor belt. The camera is typically set up at approximately 10 feet height, 45-degree angle off the vertical axis and has close field-of-view. This content was chosen to decrease the simulation-to-reality gap of the model trained on synthetic data, and to improve the accuracy and the robustness of the model. The logos on retail items were smudged.
Fine-tuning Data Ground-truth Labeling Guidelines
The fine tuning data are created by labeling ground-truth bounding-boxes and categories by human-labelers. The following guidelines were used while labeling the training data for NVIDIA Retail Detection models. If you are looking to transfer-learn or to fine-tune the models to adapt to your target environment and classes, please follow the guidelines below for better model accuracy.
- All objects that fall under the definition of retail items and are larger than the smallest bounding-box limit for the corresponding class (height >= 10px OR width >= 10px) are labeled with the appropriate class label.
- Occlusion: For partially occluded objects that are visible approximately 60% or are marked as visible objects with a bounding box around the visible part of the object. These objects are marked as partially occluded. Objects under 60% visibility are not annotated.
- Truncation: An object, at the edge of the frame, which is 60% or more visible is marked with the truncation flag.
- Each frame is not required to have an object.
Evaluation Dataset:
Data Collection Method by dataset:
- Automatic/Sensors
Labeling Method by dataset:
- Human
Properties:
15,000 proprietary real-world images of objects found in retail settings.
Methodology and KPI
AP50 is calculated using intersection-over-union (IOU) criterion greater than 0.5. The KPI for the evaluation data are reported in the table below. Model is evaluated based on AP50 and AR0.5:0.95. Both AR and AP numbers are based on 100 maximum detections each image. Please note that “unseen items” measurements are irrelevant to the 100-class detection model.
Binary-class Retail Item Detection Model
scene | seen items result (AP50) | seen items result (AR MaxDets=100) | unseen items result (AP50) | unseen items result (AR MaxDets=100) |
---|---|---|---|---|
checkout counter 45 degree overhead | 0.960 | 0.791 | 0.959 | 0.753 |
shelf | 0.983 | 0.888 | 0.978 | 0.841 |
conveyor belt | 1.000 | 0.921 | 0.995 | 0.887 |
basket | 0.956 | 0.851 | 0.959 | 0.861 |
checkout counter barcode scanner view | 0.858 | 0.789 | 0.744 | 0.655 |
checkout counter overhead | 0.990 | 0.915 | 0.993 | 0.910 |
overall (mean of all scenes) | 0.959 | 0.859 | 0.938 | 0.818 |
Inference:
Engine: Tensor(RT)
Test Hardware:
- Jetson AGX Xavier
- Xavier NX
- Orin
- Orin NX
- NVIDIA T4
- Ampere GPU
- A2
- A30
- L4
- T4
- DGX H100
- DGX A100
- DGX H100
- L40
- JAO 64GB
- Orin NX16GB
- Orin Nano 8GB
The inference is run on the provided unpruned model at FP16 precision. The model input resolution is 416x416. The inference performance is run using trtexec on Jetson AGX Orin 64GB and A10. The performance shown here is the inference only performance. The end-to-end performance with streaming video data might slightly vary depending on other bottlenecks in the hardware and software.
model | device | batch size | latency (ms) | images per second |
---|---|---|---|---|
Retail Item Detection (binary) | Jetson AGX Orin 64GB | 1 | 10.43 | 96 |
Retail Item Detection (binary) | Jetson AGX Orin 64GB | 16 | 131.79 | 121 |
Retail Item Detection (binary) | Jetson AGX Orin 64GB | 32 | 258.44 | 124 |
Retail Item Detection (binary) | Tesla A10 | 1 | 4.27 | 234 |
Retail Item Detection (binary) | Tesla A10 | 16 | 44.94 | 356 |
Retail Item Detection (binary) | Tesla A10 | 64 | 174.46 | 367 |
Instructions to use unpruned model with TAO
In order to use these models as pretrained weights for transfer learning, please use the snippet below as template for the model component of the experiment spec file to train a Efficientdet-TF2 model. For more information on the experiment spec file, please refer to the [RetailDetector notebook and the EfficientDdet-TFtf2 TAO doc].
% spec file
model:
name: 'efficientdet-d5'
data:
loader:
prefetch_size: 4
shuffle_file: True
num_classes: 101 # switch to 2 for RetailDetector_binary model
image_size: '416x416'
max_instances_per_image: 10
train_tfrecords:
- [train tfrecords]
val_tfrecords:
- [validation tfrecords]
val_json_file: [validation annotation json file path]
train:
num_examples_per_epoch: 10000 # change to train set size
...
evaluate:
num_samples: 500 # change to test set size
label_map: # label map file here
...
Instructions to deploy these models with DeepStream
Here is an example of using the Retail Item Embedder together with the Retail Item Detector [TODO: add Retail Item Embedder url here] for an end-to-end video analytic application. To do so, deploy these models with DeepStream SDK. DeepStream SDK is a streaming analytic toolkit to accelerate building AI-based video analytic applications. DeepStream supports direct integration of these models into the deepstream sample app.
To deploy these models with DeepStream 6.2, please follow the instructions below:
Download and install DeepStream SDK. The installation instructions for DeepStream are provided in DeepStream development guide. The config files for the purpose-built models are located in:
/opt/nvidia/deepstream is the default DeepStream installation directory. This path will be different if you are installing in a different directory.
The sample config files are provided in NVIDIA-AI-IOT(TODO: Update the URL when deepstream_tao_apps are merged with???). Assume the repo is cloned under $DS_TAO_APPS_HOME
, in $DS_TAO_APPS_HOME/configs/retailDetector_tao
,
# Binary-class detector (the primary GIE) inference setting
pgie_retailDetector_binary_config.yml
pgie_retailDetector_binary_config.txt
Key Parameters in pgie_retailDetector_100_tao_config.yml
property:
gpu-id: 0
net-scale-factor: 1
offsets: 0;0;0
model-color-format: 0
tlt-model-key: nvidia_tlt
tlt-encoded-model: ../../models/retailDetector/retailDetector_100.etlt
model-engine-file: ../../models/retailDetector/retailDetector_100.etlt_b1_gpu0_fp16.engine
labelfile-path: ../../models/retailDetector/retailDetector_100_labels.txt
network-input-order: 1
infer-dims: 3;416;416
maintain-aspect-ratio: 1
batch-size: 1
## 0=FP32, 1=INT8, 2=FP16 mode
network-mode: 2
num-detected-classes: 100
interval: 0
cluster-mode: 3
output-blob-names: num_detections;detection_boxes;detection_scores;detection_classes
parse-bbox-func-name: NvDsInferParseCustomEfficientDetTAO
custom-lib-path: ../../post_processor/libnvds_infercustomparser_tao.so
#Use the config params below for NMS clustering mode
class-attrs-all:
pre-cluster-threshold: 0.5
In order to decode the bounding box information from the EfficientDet output tensor, the custom parser function and library have to be specified. To inference the model, please run:
cd $DS_TAO_APPS_HOME/configs/retailDetector_tao
$DS_TAO_APPS_HOME/apps/tao_detection/ds-tao-detection -c retailDetector_100_config.txt -i file://$DS_TAO_APPS_HOME/samples/streams/retailDetector_h264.mp4
The "Deploying to DeepStream" chapter of TAO User Guide provides more details.
- Access the latest in Vision AI development workflows with NVIDIA TAO Toolkit 5.0
- Improve accuracy and robustness of vision ai models with vision transformers and NVIDIA TAO
- Train like a ‘pro’ without being an AI expert using TAO AutoML
- Create Custom AI models using NVIDIA TAO Toolkit with Azure Machine Learning
- Developing and Deploying AI-powered Robots with NVIDIA Isaac Sim and NVIDIA TAO
- Learn endless ways to adapt and supercharge your AI workflows with TAO - Whitepaper
- Customize Action Recognition with TAO and deploy with DeepStream
- Read the 2 part blog on training and optimizing 2D body pose estimation model with TAO - Part 1 | Part 2
- Learn how to train real-time License plate detection and recognition app with TAO and DeepStream.
- Model accuracy is extremely important, learn how you can achieve state of the art accuracy for classification and object detection models using TAO
- More information on about TAO Toolkit and pre-trained models can be found at the NVIDIA Developer Zone
- TAO documentation
- Read the TAO getting Started guide and release notes.
- If you have any questions or feedback, please refer to the discussions on TAO Toolkit Developer Forums
- Deploy your models for video analytics application using DeepStream. Learn more about DeepStream SDK
- Deploy your models in Riva for ConvAI use case.
Ethical Considerations:
NVIDIA Retail Object Detection model detects retail items. However, no additional information such as people and other distractors in the background are inferred. Training and evaluation dataset mostly consists of North American content. An ideal training and evaluation dataset would additionally include content from other geographies.
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.