Model Overview
Description:
Florence-2 is an advanced vision foundation model using a prompt-based approach to handle a wide range of vision and vision-language tasks. It can interpret simple text prompts to perform tasks like captioning, object detection and segmentation.
This model is ready for non-commercial use.
Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to the Florence-2 Model Card.
License/Terms of Use
References:
Model Architecture:
Architecture Type: Transformer-Based
Network Architecture: DaViT; standard encoder-decoder
Input:
Input Type(s): Image, Text.
Input Format(s): Red, Green, Blue (RGB), String
Input Parameters: Two Dimensional (2D)
Other Properties Related to Input: Task prompt.
The model can perform 14 different vision language model and computer vision tasks. The input content
field should be formatted as "<TASK_PROMPT><text_prompt (only when needed)><img>"
. Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the list of tasks:
- Caption
- Detailed Caption
- More Detailed Caption
- Region to category
- Region to description
- Caption to Phrase Grounding
- Object Detection
- Dense Region Caption
- Region proposal
- Open vocabulary detection
- Referring expression segmentation
- Region to segmentation
- Optical character recognition
- Optical character recognition with region
For <CAPTION_TO_PHRASE_GROUNDING>
, <REFERRING_EXPRESSION_SEGMENTATION>
, <OPEN_VOCABULARY_DETECTION>
, the text prompt is a normal description. For example: '<OPEN_VOCABULARY_DETECTION>dog<img src="data:image/jpeg;asset_id,868f5924-8ef2-4d8d-866e-87bb423126cb" />'
.
For <REGION_TO_SEGMENTATION>
, <REGION_TO_CATEGORY>
, <REGION_TO_DESCRIPTION>
, the text prompt must be formatted as <loc_x1><loc_y1><loc_x2><loc_y2>
, which is the normalized coordinates from region of interest bbox as calculated below. For example: '<REGION_TO_SEGMENTATION><loc_2><loc_3><loc_998><loc_997><img src="" />'
.
x1=int(top_left_x_coor/width*999)
y1=int(top_left_y_coor/height*999)
x2=int(bottom_right_x_coor/width*999)
y2=int(bottom_right_y_coor/height*999)
Other tasks don't take text prompt input. For example: '<CAPTION><img src="data:image/png;asset_id,868f5924-8ef2-8g3c-866e-87bb423126cb" />'
.
Output:
Output Type(s): Text, Bounding Box, Segmentation Mask
Output Format: String or Dictionary (Text), Image (RBG, Black & White)
Output Parameters: One Dimensional (1D)- Text, 2D- Bounding Box, Segmentation Mask
Other Properties Related to Output:
The response data needs to be saved into a zip file and extracted. It contains an overlay image (when bounding box or segmentation is generated) and a <id>.response
JSON file.
For caption related tasks, the output is saved in "content": "<TASK_PROMPT>caption"
. For example, "content": "<CAPTION>A black and brown dog in a grass field"
For bounding box or segmentation masks, the output is saved in "entities": "{"bboxes":[], "quad_boxes":[], "labels":[], "polygons": []}"
. For example, "entiites": {"bboxes":[[192.47,68.882,611.081,346.83],[1.529,240.178, 611.081,403.394]],"quad_boxes":null,"labels":["A black and brown dog","a grass field"],"bboxes_labels":null,"polygons":null}
Software Integration:
Runtime Engine(s):
- PyTorch
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Jetson
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Pascal
- NVIDIA Turing
- NVIDIA Volta
[Preferred/Supported] Operating System(s):
- Linux
- Windows
Model Version(s):
- Florence-2-base
- Florence-2-large
- Florence-2-base-ft
- Florence-2-large-ft
Training and Testing Datasets:
Training Dataset:
Link
- FLD-5B dataset (Microsoft)
Data Collection Method by dataset
- Hybrid: Human, Automatic/Sensors
Labeling Method by dataset
- Hybrid: Human, Automatic/Sensors
Properties (Quantity, Dataset Descriptions, Sensor(s))
- The dataset consists of images from a diverse collection of purposes, including caption, detection, segmentation and optical character recognition. There are 126 million images, 500 million text annotations, and 1.3 billion text-region annotations, and 3.6 billion text-phrase-region annotations across different tasks.
Testing Dataset:
Link
Data Collection Method by dataset
- Hybrid: Human, Automatic/Sensors
Labeling Method by dataset
- Hybrid: Human, Automatic/Sensors
Properties (Quantity, Dataset Descriptions, Sensor(s))
- COCO: COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: 330K images (>200K labeled), 1.5 million object instances
- Flickr30k: The Flickr30k dataset contains 31,000 images collected from Flickr, together with five (5) reference sentences provided by human annotators
Inference:
Engine: PyTorch
Test Hardware:
- NVIDIA L40