microsoft / florence-2

Model Overview

Description:

Florence-2 is an advanced vision foundation model using a prompt-based approach to handle a wide range of vision and vision-language tasks. It can interpret simple text prompts to perform tasks like captioning, object detection and segmentation.

This model is ready for non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to the Florence-2 Model Card.

License/Terms of Use

MIT license.

References:

Model Architecture:

Architecture Type: Transformer-Based

Network Architecture: DaViT; standard encoder-decoder

Input:

Input Type(s): Image, Text.

Input Format(s): Red, Green, Blue (RGB), String

Input Parameters: Two Dimensional (2D)

Other Properties Related to Input: Task prompt.

The model can perform 14 different vision language model and computer vision tasks. The input content field should be formatted as "<TASK_PROMPT><text_prompt (only when needed)><img>". Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the list of tasks:

Caption
Detailed Caption
More Detailed Caption
Region to category
Region to description
Caption to Phrase Grounding
Object Detection
Dense Region Caption
Region proposal
Open vocabulary detection
Referring expression segmentation
Region to segmentation
Optical character recognition
Optical character recognition with region

For <CAPTION_TO_PHRASE_GROUNDING>, <REFERRING_EXPRESSION_SEGMENTATION>, <OPEN_VOCABULARY_DETECTION>, the text prompt is a normal description. For example: '<OPEN_VOCABULARY_DETECTION>dog<img src="data:image/jpeg;asset_id,868f5924-8ef2-4d8d-866e-87bb423126cb" />'.

For <REGION_TO_SEGMENTATION>, <REGION_TO_CATEGORY>, <REGION_TO_DESCRIPTION>, the text prompt must be formatted as <loc_x1><loc_y1><loc_x2><loc_y2>, which is the normalized coordinates from region of interest bbox as calculated below. For example: '<REGION_TO_SEGMENTATION><loc_2><loc_3><loc_998><loc_997><img src="data:image/jpeg;base64,iVBORw0KGgoAAAANSUhEUgAAAAgAAAAICAIAAABLbSncAAAAGUlEQVR4nGK5nHuGARtgwio6aCUAAQAA//+evgIfjH1FEwAAAABJRU5ErkJggg==" />'.

x1=int(top_left_x_coor/width*999)
y1=int(top_left_y_coor/height*999)
x2=int(bottom_right_x_coor/width*999)
y2=int(bottom_right_y_coor/height*999)

Other tasks don't take text prompt input. For example: '<CAPTION><img src="data:image/png;asset_id,868f5924-8ef2-8g3c-866e-87bb423126cb" />'.

Output:

Output Type(s): Text, Bounding Box, Segmentation Mask

Output Format: String or Dictionary (Text), Image (RBG, Black & White)
Output Parameters: One Dimensional (1D)- Text, 2D- Bounding Box, Segmentation Mask

Other Properties Related to Output:

The response data needs to be saved into a zip file and extracted. It contains an overlay image (when bounding box or segmentation is generated) and a <id>.response JSON file.

For caption related tasks, the output is saved in "content": "<TASK_PROMPT>caption". For example, "content": "<CAPTION>A black and brown dog in a grass field"

For bounding box or segmentation masks, the output is saved in "entities": "{"bboxes":[], "quad_boxes":[], "labels":[], "polygons": []}". For example, "entiites": {"bboxes":[[192.47,68.882,611.081,346.83],[1.529,240.178, 611.081,403.394]],"quad_boxes":null,"labels":["A black and brown dog","a grass field"],"bboxes_labels":null,"polygons":null}

Software Integration:

Runtime Engine(s):

PyTorch

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Blackwell
NVIDIA Jetson
NVIDIA Hopper
NVIDIA Lovelace
NVIDIA Pascal
NVIDIA Turing
NVIDIA Volta

[Preferred/Supported] Operating System(s):

Linux
Windows

Model Version(s):

Florence-2-base
Florence-2-large
Florence-2-base-ft
Florence-2-large-ft

Training and Testing Datasets:

Training Dataset:

Link

FLD-5B dataset (Microsoft)

Data Collection Method by dataset

Hybrid: Human, Automatic/Sensors

Labeling Method by dataset

Hybrid: Human, Automatic/Sensors

Properties (Quantity, Dataset Descriptions, Sensor(s))

The dataset consists of images from a diverse collection of purposes, including caption, detection, segmentation and optical character recognition. There are 126 million images, 500 million text annotations, and 1.3 billion text-region annotations, and 3.6 billion text-phrase-region annotations across different tasks.

Testing Dataset:

Link

COCO, Flickr30k

Data Collection Method by dataset

Hybrid: Human, Automatic/Sensors

Labeling Method by dataset

Hybrid: Human, Automatic/Sensors

Properties (Quantity, Dataset Descriptions, Sensor(s))

COCO: COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: 330K images (>200K labeled), 1.5 million object instances
Flickr30k: The Flickr30k dataset contains 31,000 images collected from Flickr, together with five (5) reference sentences provided by human annotators

Inference:

Engine: PyTorch

Test Hardware:

NVIDIA L40