Model Overview
Description:
Kosmos-2 model is a groundbreaking multimodal large language model (MLLM).
Kosmos-2 is designed to ground text to the visual world,
enabling it to understand and reason about visual elements in images.
Note
This API is used in conjunction with the NVCF large assets API.
Terms of use
By using this model, you are agreeing to the terms and conditions of the
license,
acceptable use policy and Microsoft Research privacy policy.
References(s):
Model Architecture:
Architecture Type: Transformer
Network Architecture: GPT + CLIP
Input:
Input Format: Red, Green, Blue (RGB) Image + Text
Input Parameters: Temperature, TopP
Other Properties Related to Input: None
Output:
Output Format: Text
Output Parameters: Max output tokens, Bounding boxes
Other Properties Related to Output: None
Supported Operating System(s):
Linux
Inference:
Engine: Triton
Test Hardware: Other