nvidia / embed-qa-4

Model Overview

Description

The NVIDIA Retrieval QA Embedding Model is an embedding model optimized for text question-answering retrieval. An embedding model is a crucial component of a text retrieval system, as it transforms textual information into dense vector representations. They are typically transformer encoders that process tokens of input text (for example, question, passage) to output an embedding.

NVIDIA Retrieval QA Embedding Model is a part of NVIDIA NeMo Retriever, which provides state-of-the-art, commercially-ready models and microservices, optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can also readily customize them for their domain-specific use cases, such as Information Technology, Human Resource help assistants, and Research & Development research assistants.

Terms of use

The use of this model is governed by
the NVIDIA NeMo Foundational Models Evaluation License Agreement

References(s)

Technical Blog

Model Architecture

Architecture Type: Transformer

Network Architecture: Fine-tuned E5-Large-Unsupervised retriever

The NVIDIA Retrieval QA Embedding Model is a transformer encoder - a finetuned version of E5-Large-Unsupervised, with 24 layers and an embedding size of 1024, which is trained on private and public datasets as described in the Dataset and Training section. It supports a maximum input of 512 tokens.

Embedding models for text retrieval are typically trained using a bi-encoder architecture. This involves encoding a pair of sentences (for example, query and chunked passages) independently using the embedding model. Contrastive learning is used to maximize the similarity between the query and the passage that contains the answer, while minimizing the similarity between the query and sampled negative passages not useful to answer the question.

Input

Input Type: text

Input Format: list of strings

Output

Output Type: floats

Output Format: list of float arrays, each array containing the embeddings for the corresponding input string.

Model Version(s)

NVIDIA Retrieval QA Embedding Model-4.0

Training Dataset & Evaluation

Training Dataset

The development of large-scale public open-QA datasets has enabled tremendous progress in powerful embedding models. However, one popular dataset named MSMARCO restricts ‌commercial licensing, limiting the use of these models in commercial settings. To address this, we created our own internal open-domain QA dataset to train a commercially-viable embedding model. For NVIDIA proprietary data collection, we searched the passages from web logs and selected a collection of passages relevant to customer use cases for annotation by the NVIDIA internal data annotation team.

To minimize the redundancy in our data collection process, we selected samples that maximized relevancy distance scores and increased diversity in the data. The pretrained E5-Large-Unsupervised embedding model was fine-tuned with contrastive learning with the prefix of “query:” for questions and “passage:” for context passages. Specifically, a mixture of English language datasets are used including our proprietary dataset, along with selected samples from commercially-viable public datasets. The AdamW optimizer is employed, incorporating 300 warm-up steps and 1e-6 learning rate with linear annealing schedule.

The training dataset details are as follows:

Use Case: Information retrieval for question and answering over text documents.

Data Sources:

Public datasets licensed for commercial use.
Text from public websites.
Annotations created by NVIDIA’s internal team.

Language: English (US)

Domains: Knowledge, Description, Numeric (unit, time), Entity, Location, Person

Volume: 40k internal proprietary samples, 200k samples from public dataset

High Level Schema:

query: question text
doc: full document that contains the answer
chunk: section of the document that contains the answer
relevancy label: rating of how relevant the passage is to the question
span: exact token range in the chunk that contains the answer

Evaluation Results

We evaluated the NVIDIA Retrieval QA Embedding Model in comparison to literature open & commercial retriever models on academic benchmarks - NQ, HotpotQA and FiQA(Finance Q&A) from BeIR benchmark, and the TechQA(Tech Support Q&A) dataset. In this benchmark, the metric used was Normalized Discounted Cumulative Gain@10 (NDCG@10).

Open & Commercial Retrieval Models	Average Recall@5 on NQ, HotpotQA, FiQA, TechQA dataset
NVIDIA Retrieval QA	57.37%
E5-Large_unsupervised	45.58%
BM25	39.97%

We also evaluated our embedding model with real internal customer datasets from telco, IT, consulting, and energy industries. The metric was Recall@5, to emulate a retrieval augmented generation (RAG) scenario where we would provide the top five most relevant passages as context in the prompt for the LLM model that is going to respond to the question. We compared our model’s information retrieval accuracy to a number of well-known embedding models made available by the AI community, including ones trained on non-commercial dataset (which are marked with "*").

Retrieval Model	Average Recall@5 on Internal Customer Datasets
NVIDIA Retrieval QA	74.4%
DRAGON*	72.7%
E5-Large*	71.7%
BGE*	71.1%
GTR*	71.0%
Contriever*	69.0%
GTE*	63.9%
E5-Large_unsupervised	61.6%
BM25	55.6%

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards here. Please report security vulnerabilities or NVIDIA AI Concerns here.

Intended use

The NVIDIA Retrieval QA Embedding model is most suitable for users who want to build a question and answer application over a large text corpus, leveraging the latest dense retrieval technologies.

Ethical use: Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your business decisions by following the guidelines in the NVIDIA AI Foundation Models Community License Agreement.

Limitations

The model was trained on the data that may contain toxic language and societal biases originally crawled from the Internet. Therefore, the model may amplify those biases, for example, associating certain genders with certain social stereotypes.

Usage

The NVIDIA Retrieval QA Embedding model is meant to be deployed using the NeMo Retriever Embedding Microservice. Check out the microservice documentation for more details.