Cosmos Nemotron 34B Model Card
Description
NVIDIA Cosmos Nemotron 34B is a leading vision language model (VLMs) that enables the ability to query and summarize images and video from the physical or virtual world. Cosmos Nemotron 34B is deployable in the data center, cloud and at the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance.
This model is ready for commercial use. It was trained on commercial images and videos for all three stages of training and supports single image and video inference. This version does not support interleaved and in-context learning capabilities.
References
License
The license to use this model is covered by the Model EULA. By downloading the unpruned or pruned version of the model, you accept the terms and conditions of these licenses
Model Architecture
Architecture Type: Transformer-based Network Architecture
Network Architecture
- Vision Encoder: SigLIP-400M
- Language Encoder: Yi-34B
Input
- Input Type(s): Image, Video, Text
- Input Format(s): Image (Red, Green, Blue (RGB)), Video (.mp4), and Text (String)
- Input Parameters: Image (2D), Video (3D), Text (1D)
Output
- Output Type(s): Text
- Output Formats: String
- Output Parameters: 1D
- Other Properties Related to Output: N/A
Software Integration
- Runtime Engine(s): TensorRT-LLM
- Supported Hardware Architecture(s): NVIDIA Hopper
- Supported Operating System(s): Linux
Model Versions
- Cosmos-Nemotron-SigLIP-Yi-34B
Training Dataset
NV-Pretraining and NV-CosmosNemotron-SFT data were used.
Additionally, the commercial subset of following datasets were used:
- OASST1
- OASST2
- Localized Narratives
- TextCaps
- TextVQA
- RefCOCO
- VQAv2
- GQA
- SynthDoG-en
- A-OKVQ
- WIT
- CLEVR
- CLEVR-X
- CLEVR-Math
- ScreenQA
- WikiSQL
- WikiTablQuestions
- RenderedText
- FinQA
- TAT-QA
- Dolly
- Websight
- RAVEN
- VizWiz
- Inter-GPS
- YouCook2
- ActivityNet Captions
- Video Localized Narratives
- CLEVRER
- Perception Test
- Next-QA
- Kinetics-400
Data Collection Method by dataset:
- Hybrid: Human, Automatic/Sensors
Labeling Method by dataset:
- Hybrid: Human, Automatic/Sensors
Properties:
- NV-Pretraining data was collected from 5M subsampled NV-CLIP dataset. Stage 3 NV-SFT data has 2.8M images and 3.58M annotations on images that only have commercial license. Additionally, 355K videos with commercial license and 400K annotations on videos were used.
Evaluation Data
Data Collection Method by dataset:
- Hybrid: Human, Automatic/Sensors
Labeling Method by dataset:
- Hybrid: Human, Automatic/Sensors
Properties:
- A collection of different benchmarks, including academic VQA benchmarks and recent benchmarks specifically proposed for instruction-following LMMs.
- VQAv2
- GQA
- ScienceQA Image
- Text VQA
- POPE
- MME
- SEED-Bench
- MMMU
- Video MME
- Egoschema
- Perception Test
Methodology and KPI
Benchmark | VQAv2 | GQA | SQA Image | Text VQA | POPE (Popular) | MME | SEED | SEED Image | MMMU val (beam 5) | SEED Video | VideoMME w/o Sub @32f | VideoMME w/ Sub @32f | Egoschema (val) | Perception Test |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Accuracy | 81.70 | 62.13 | 79.62 | 71.14 | 85.61 | 1649.62 | 70.36 | 74.12 | 47.33 | 58.21 | 57.85 | 60.67 | 63.8 | 61.76 |
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Promise and the Explainability, Bias, Safety & Security, and Privacy Subcards.
Please report security vulnerabilities or NVIDIA AI Concerns here.