Model Overview
Description
The Llama 3.2 NeMo Retriever Multimodal Embedding 1B model is optimized for multimodal question-answering retrieval. The model can embed 'documents' in the form of image, text, or image and text combined. Documents can be retrieved given a user query in text form. The model supports images containing text, tables, charts, and infographics. This model was evaluated on ViDoRe V1 and two internal multimodal retrieval benchmarks.
An embedding model is a crucial component of a retrieval system, because it transforms information into dense vector representations. An embedding model is typically a transformer encoder that processes tokens of input (text or image) (for example: question, passage) to output an embedding. The Llama 3.2 NeMo Retriever Multimodal Embedding 1B model is a combined language model and vision model.
The Llama 3.2 NeMo Retriever Multimodal Embedding 1B model is a part of the NVIDIA NeMo Retriever collection of NIM, which provides state-of-the-art, commercially-ready models and microservices optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can readily customize them for domain-specific use cases, such as information technology, human resource help assistants, and research & development research assistants.
This model is ready for commercial use.
License/Terms of use
GOVERNING TERMS: Access to this trial is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Community Model License.
ADDITIONAL INFORMATION: Llama 3.2 Community License Agreement. Built with Llama.
Deployment Geography:
Global
Use Case:
The Llama 3.2 NeMo Retriever Multimodal Embedding 1B model is most suitable for users who want to build a multimodal question-and-answer application over a large corpus, leveraging the latest dense retrieval technologies.
Release Date:
Build.Nvidia.com: May 20, 2025 via https://build.nvidia.com/nvidia/llama-3_2-nemoretriever-1b-vlm-embed-v1
NGC: May 20, 2025
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.
Model Architecture
Architecture Type: Transformer
Network Architecture: Fine-tuned MultiModal Llama 3.2 1B Retriever
This NeMo Retriever embedding model is a transformer encoder. It is a fine-tuned version of Llama 3.2 1B with SigLip2 400M, with 16 layers and an embedding size of 2048, which is trained on public datasets. Embedding models for text retrieval are typically trained using a bi-encoder architecture. This involves encoding a pair of query and document independently using the embedding model. Contrastive learning is used in this model to maximize the similarity between the query and the document that contains the answer, while minimizing the similarity between the query and sampled negative documents not useful to answer the question.
The vision-language model encoder incorporates key innovations from NVIDIA, including Eagle 2 work and nemoretriever-parse, which use a tiling-based VLM architecture. This architecture, available on Hugging Face, significantly enhances multimodal understanding through its dynamic tiling and mixture of vision encoders design. It particularly improves performance on tasks that involve high-resolution images and complex visual content.
Input
Property | Query | Document |
---|---|---|
Input Type | Text | Text | Image |
Input Format | List of strings | List of strings | List of Images |
Input Parameter | 1D | 1D |
Other Properties | The model's maximum context length is 8192 tokens. Texts longer than maximum length must either be chunked or truncated. | The model's maximum context length is 8192 tokens. Texts longer than maximum length must either be chunked or truncated. Images must be 8192 x 16384 or 16384 x 8192 and less than 25MB. They are resized automatically by the NIM. |
Output
Output Type: Floats
Output Format: List of float arrays
Output: Model outputs embedding vectors of maximum dimension 2048 for each input.
Other Properties Related to Output: N/A
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (such as GPU cores) and software frameworks (such as CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration
Runtime Engine: NeMo Retriever Embedding NIM
Supported Hardware Microarchitecture Compatibility: NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Hopper, NVIDIA Lovelace
Supported Operating System(s): Linux
Model Version(s)
Llama 3.2 NeMo Retriever Multimodal Embedding 1B
Short Name: llama-3.2-nemoretriever-1b-vlm-embed-v1
Training Dataset & Evaluation
Training Dataset
The development of large-scale, public, open-QA datasets has enabled tremendous progress in powerful embedding models. However, the following issues limit the use of these models in commercial settings.
- One popular dataset, named MS MARCO, restricts commercial licensing.
- Many multimodal datasets use synthetic data generation with proprietary models.
To address these issues, NVIDIA created its own training dataset. NVIDIA's training dataset is based on public QA datasets, and only includes datasets that have a license for commercial applications.
Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
Properties: The text component is comprised of semi-supervised pre-training on 12M samples from public datasets and fine-tuning on 1.5M samples from public datasets. The VLM component uses only commercially-viable data from the Eagle2 training data.
Evaluation Datasets
We evaluated the NeMo Retriever Multimodal Embedding Model against both published literature and existing open-source and commercial retriever models. Our evaluation used three benchmark datasets for question-answering tasks: the public ViDoRe V1 benchmark and two internal multimodal retrieval benchmarks. For those interested in reproducing our results, one of our internal datasets (DigitalCorpora-767) can be created by following instructions in this notebook from the NeMo Retriever Extraction GitHub repository.
Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
Properties: More details on ViDoRe V1 can be found on their leaderboard. DigitalCorpora-767 is a set of 767 PDFs that have a good mixture of text, tables, and charts.
Evaluation Results
Model | # Params Vision (in M) | # Params LLM-backbone (in M) | Average Recall@5 on DigitalCorpora-767, Earnings, ViDoRe V1 |
---|---|---|---|
llama-3.2-nemoretriever-1b-vlm-embed-v1 | 429 | 1236 | 80.9% |
llamaindex/vdr-2b-multi-v1 | 665 | 1544 | 80.9% |
MrLight/dse-qwen2-2b-mrl-v1 | 665 | 1544 | 80.4% |
Alibaba-NLP/gme-Qwen2-VL-2B-Instruct | 665 | 1544 | 79.9% |
We do not compare to col-style embedding (late interaction) models because late interaction embeddings require a significant embedding store.
Detailed Performance Analysis
The model's performance was evaluated across different modalities and compared with other models using various pipelines. The following table contains the detailed results for the DigitalCorpora-767 dataset:
Modality | Queries | Text-based Pipeline (llama-3.2-nv-embedqa-1b-v2) | VLM-based Pipeline (llama-3.2-nemoretriever-1b-vlm-embed-v1) |
---|---|---|---|
Multimodal | 991 | 0.845 | 0.865 |
Table | 235 | 0.753 | 0.838 |
Chart | 268 | 0.881 | 0.881 |
Text | 488 | 0.869 | 0.869 |
Inference
Engine: TensorRT
Test Hardware: H100 PCIe/SXM, A100 PCIe/SXM, L40s, L4, and A10G
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case, and address unforeseen product misuse.
For more detailed information on ethical considerations for this model, see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards.
Please report security vulnerabilities or NVIDIA AI Concerns here.