microsoft / kosmos-2

Model Overview

Description:

Kosmos-2 model is a groundbreaking multimodal large language model (MLLM).
Kosmos-2 is designed to ground text to the visual world,
enabling it to understand and reason about visual elements in images.

📗

Note

This API is used in conjunction with the NVCF large assets API.

Terms of use

By using this model, you are agreeing to the terms and conditions of the
license,
acceptable use policy and Microsoft Research privacy policy.

References(s):

Model Architecture:

Architecture Type: Transformer

Network Architecture: GPT + CLIP

Input:

Input Format: Red, Green, Blue (RGB) Image + Text

Input Parameters: Temperature, TopP

Other Properties Related to Input: None

Output:

Output Format: Text

Output Parameters: Max output tokens, Bounding boxes

Other Properties Related to Output: None

Supported Operating System(s):

Linux

Inference:

Engine: Triton

Test Hardware: Other