meta / sam2

Model Overview

Description:

SAM 2, Segment Anything Model, is a foundation model from Meta that can segment any objects provided with a visual prompt. This model is released from Meta Facebook Artificial Intelligence Research (FAIR).
It produces segmentation masks of the object of interest, in single images and across video frames.

The capabilities of this model has been enhanced with a GroundingDINO model to support text prompts based segmentation. GroundingDINO is an open vocabulary object detection model that can detect one or multiple objects
in a frame based on the text input.

With GroundingDINO and SAM2, users will be able to segment any objects with a single text prompt

Third-Party Community Consideration

SAM2 model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case. For SAM 2 Model Card, refer to Model, data and annotation cards section in SAM 2 paper.

GroundingDINO is a NVIDIA trained model, pretrained on wide range of commercial datasets where the annotations were either human generated or pseudo-labeled. For more information about the model, refer to Groudning DINO model card.

License/Terms of Use

GOVERNING TERMS: Your use of this API is governed by the NVIDIA API Trial Service Terms of Use;
and the use of this model is governed by the NVIDIA Community Model License.

References:

Model Architecture

Architecture Type: Transformer

Network Architecture: SAM2. For more details, refer to Model section in SAM 2 paper.

Input:

Input Types : Image, Video, Integers (Visual Prompts), text
Input Formats : Image - JPEG, PNG ; Video - MP4
Input Parameters: 2D, 3D
Other Properties related to Input :
- The visual prompts include (X,Y) points with labels(include/exclude) in the image/video which are selected by the user by clicking on the image/video in the UI. The visual prompts allow the user to select the regions of interest for segmentation in an image or select the regions of interest to be tracked in a video.
- Users can also provide text description to detect and segement the object of interest.

Output:

Output Types: Image, Video, Integers (Segmentation mask)
Output Format: Image - JPEG, PNG ; Video - MP4
Output Parameters: 2D, 3D
Other Properties related to Output :
- For Image input, image with segmentation mask(s) overlaid on the object(s) of interest is given as output.
- For Video input, video with segmentation mask(s) overlaid on the object(s) of interest.
- The segmentation mask is a mask corresponding to the input image/video resolution and the background is represented with value 0 and the object(s) of interest are represented with the respective object id.

Model Version(s):

Datasets:

For details about the dataset refer to SA-V DataSet and Model, data and annotation cards section in SAM 2 paper.

Training and Testing Datasets:

Link

SA-V DataSet Download Link .

Data Collection Method by dataset

Humans: Videos were collected by crowdworkers with unknown equipment via a contracted third-party vendor.

Labeling Method by dataset

Hybrid: Human and Automatic. Masks generated by the Meta Segment Anything Model 2 (SAM 2) and human annotators.

Properties (Quantity, Dataset Descriptions, Sensor(s))

SA-V dataset consists of 51K diverse videos and 643K spatio-temporal segmentation masks (i.e., masklets). The videos vary in subject matter. Common themes of the videos include: locations, objects, scenes. Masks range from large scale objects such as buildings to fine grained details such as interior decorations.

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.