microsoft / florence-2

Model Overview

Description:

Florence-2 is an advanced vision foundation model using a prompt-based approach to handle a wide range of vision and vision-language tasks. It can interpret simple text prompts to perform tasks like captioning, object detection and segmentation.

This model is ready for non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to the Florence-2 Model Card.

License/Terms of Use

MIT license.

References:

Model Architecture:

Architecture Type: Transformer-Based

Network Architecture: DaViT; standard encoder-decoder

Input:

Input Type(s): Image, Text.

Input Format(s): Red, Green, Blue (RGB), String

Input Parameters: Two Dimensional (2D)

Other Properties Related to Input: Task prompt.

The model can perform 14 different vision language model and computer vision tasks. The input content field should be formatted as "<TASK_PROMPT><text_prompt (only when needed)><img>". Users need to specify the task type at the beginning. Image supports both base64 and NvCF asset id. Some tasks require a text prompt, and users need to provide that after image. Below are the list of tasks:

  • Caption
  • Detailed Caption
  • More Detailed Caption
  • Region to category
  • Region to description
  • Caption to Phrase Grounding
  • Object Detection
  • Dense Region Caption
  • Region proposal
  • Open vocabulary detection
  • Referring expression segmentation
  • Region to segmentation
  • Optical character recognition
  • Optical character recognition with region

For <CAPTION_TO_PHRASE_GROUNDING>, <REFERRING_EXPRESSION_SEGMENTATION>, <OPEN_VOCABULARY_DETECTION>, the text prompt is a normal description. For example: '<OPEN_VOCABULARY_DETECTION>dog<img src="data:image/jpeg;asset_id,868f5924-8ef2-4d8d-866e-87bb423126cb" />'.

For <REGION_TO_SEGMENTATION>, <REGION_TO_CATEGORY>, <REGION_TO_DESCRIPTION>, the text prompt must be formatted as <loc_x1><loc_y1><loc_x2><loc_y2>, which is the normalized coordinates from region of interest bbox as calculated below. For example: '<REGION_TO_SEGMENTATION><loc_2><loc_3><loc_998><loc_997><img src="" />'.

x1=int(top_left_x_coor/width*999)
y1=int(top_left_y_coor/height*999)
x2=int(bottom_right_x_coor/width*999)
y2=int(bottom_right_y_coor/height*999)

Other tasks don't take text prompt input. For example: '<CAPTION><img src="data:image/png;asset_id,868f5924-8ef2-8g3c-866e-87bb423126cb" />'.


Output:

Output Type(s): Text, Bounding Box, Segmentation Mask

Output Format: String or Dictionary (Text), Image (RBG, Black & White)
Output Parameters: One Dimensional (1D)- Text, 2D- Bounding Box, Segmentation Mask

Other Properties Related to Output:

The response data needs to be saved into a zip file and extracted. It contains an overlay image (when bounding box or segmentation is generated) and a <id>.response JSON file.

For caption related tasks, the output is saved in "content": "<TASK_PROMPT>caption". For example, "content": "<CAPTION>A black and brown dog in a grass field"

For bounding box or segmentation masks, the output is saved in "entities": "{"bboxes":[], "quad_boxes":[], "labels":[], "polygons": []}". For example, "entiites": {"bboxes":[[192.47,68.882,611.081,346.83],[1.529,240.178, 611.081,403.394]],"quad_boxes":null,"labels":["A black and brown dog","a grass field"],"bboxes_labels":null,"polygons":null}

Software Integration:

Runtime Engine(s):

  • PyTorch

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Blackwell
  • NVIDIA Jetson
  • NVIDIA Hopper
  • NVIDIA Lovelace
  • NVIDIA Pascal
  • NVIDIA Turing
  • NVIDIA Volta

[Preferred/Supported] Operating System(s):

  • Linux
  • Windows

Model Version(s):

  • Florence-2-base
  • Florence-2-large
  • Florence-2-base-ft
  • Florence-2-large-ft

Training and Testing Datasets:

Training Dataset:

Link

  • FLD-5B dataset (Microsoft)

Data Collection Method by dataset

  • Hybrid: Human, Automatic/Sensors

Labeling Method by dataset

  • Hybrid: Human, Automatic/Sensors

Properties (Quantity, Dataset Descriptions, Sensor(s))

  • The dataset consists of images from a diverse collection of purposes, including caption, detection, segmentation and optical character recognition. There are 126 million images, 500 million text annotations, and 1.3 billion text-region annotations, and 3.6 billion text-phrase-region annotations across different tasks.

Testing Dataset:

Link

Data Collection Method by dataset

  • Hybrid: Human, Automatic/Sensors

Labeling Method by dataset

  • Hybrid: Human, Automatic/Sensors

Properties (Quantity, Dataset Descriptions, Sensor(s))

  • COCO: COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: 330K images (>200K labeled), 1.5 million object instances
  • Flickr30k: The Flickr30k dataset contains 31,000 images collected from Flickr, together with five (5) reference sentences provided by human annotators

Inference:

Engine: PyTorch

Test Hardware:

  • NVIDIA L40