Sarvam-m Overview
Description
Sarvam-m generates human-like text for a seamless chatting experience, providing a smooth and accessible multilingual conversation experience, and is intended for general-purpose conversation and text generation tasks.
This model is ready for commercial/non-commercial use.
Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA sarvamai/sarvamai-m.
License/Terms of Use
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. The model is governed by the NVIDIA Community Model License Agreement; ADDITIONAL INFORMATION: Apache License Version 2.0.
You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.
Deployment Geography
Global
Use Case
This model is intended for users seeking a seamless multilingual chatting experience, particularly those interested in Indian languages and cultures. It can be used by developers, researchers, and individuals looking to leverage its advanced reasoning capabilities for coding, math, and general conversation purposes.
Release Date
Hugging Face: 05/23/2025 via
sarvamai/sarvam-m.
Build.NVIDIA.com: 07/25/2025 via link
Model Architecture
Architecture Type: Hybrid-reasoning Transformer
Network Architecture: Mistral-Small
This model was developed based on Mistral-Small-3.1-24B-Base-2503 sarvamai/sarvamai-m.
Number of model parameters: 23.6B
Input
Input Type(s): Text
Input Format(s): Strings
Input Parameters: One-Dimensional (1D)
Input Range: [0, 1] (float32) or [0, 255] (uint8, auto-converted)
Other Properties Related to Input: Supports up to 8,192-token context length (example shown in Quickstart using max_new_tokens=8192)
Output
Output Type(s):Text
Output Format(s): Strings
Output parameters: One-Dimensional (1D)
Other Properties Related to Output:
Context Length: 32,768 tokens and sliding window attention of 4096 tokens.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration
Runtime Engine: vLLM
Supported Hardware Microarchitecture Compatibility
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Hopper
- NVIDIA Lovelace
Preferred Operating System(s)
Linux
Model Version(s)
sarvam-m-v1.0
Quickstart
The following code snippet demonstrates how to use sarvam-m using Transformers.
from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "sarvamai/sarvam-m" # load the tokenizer and the model tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) # prepare the model input prompt = "Who are you and what is your purpose on this planet?" messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template( messages, tokenize=False, enable_thinking=True, # Switches between thinking and non-thinking modes. Default is True. ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) # conduct text completion generated_ids = model.generate(**model_inputs, max_new_tokens=8192) output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :].tolist() output_text = tokenizer.decode(output_ids) if "" in output_text: reasoning_content = output_text.split("")[0].rstrip("\n") content = output_text.split("")[-1].lstrip("\n").rstrip("") else: reasoning_content = "" content = output_text.rstrip("") print("reasoning content:", reasoning_content) print("content:", content)
For thinking mode, we recommend temperature=0.5; for no-think mode, temperature=0.2.
With Sarvam APIs
from openai import OpenAI base_url = "https://api.sarvam.ai/v1" model_name = "sarvam-m" api_key = "Your-API-Key" # get it from https://dashboard.sarvam.ai/ client = OpenAI( base_url=base_url, api_key=api_key, ).with_options(max_retries=1) messages = [ {"role": "system", "content": "You're a helpful AI assistant"}, {"role": "user", "content": "Explain quantum computing in simple terms"}, ] response1 = client.chat.completions.create( model=model_name, messages=messages, reasoning_effort="medium", # Enable thinking mode. `None` for disable. max_completion_tokens=4096, ) print("First response:", response1.choices[0].message.content) # Building messages for the second turn (using previous response as context) messages.extend( [ { "role": "assistant", "content": response1.choices[0].message.content, }, {"role": "user", "content": "Can you give an analogy for superposition?"}, ] ) response2 = client.chat.completions.create( model=model_name, messages=messages, reasoning_effort="medium", max_completion_tokens=8192, ) print("Follow-up response:", response2.choices[0].message.content)
Refer to API docs here: sarvam Chat Completions API docs
reasoning_effort can take three possible values: low, medium, and high to be consistent with the OpenAI API spec. Setting any of the three values just enables the thinking mode of sarvam-m.
vLLM Deployment
For easy deployment, we can use vllm>=0.8.5 and create an OpenAI-compatible API endpoint with vllm serve sarvamai/sarvam-m.
If you want to use vLLM with python, you can do the following.
from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://localhost:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) models = client.models.list() model = models.data[0].id messages = [{"role": "user", "content": "Why is 42 the best number?"}] # By default, thinking mode is enabled. # If you want to disable thinking, add: # extra_body={"chat_template_kwargs": {"enable_thinking": False}} response = client.chat.completions.create(model=model, messages=messages) output_text = response.choices[0].message.content if "" in output_text: reasoning_content = output_text.split("")[0].rstrip("\n") content = output_text.split("")[-1].lstrip("\n") else: reasoning_content = "" content = output_text print("reasoning content:", reasoning_content) print("content:", content) # For the next round, add the model's response directly as assistant turn. messages.append( {"role": "assistant", "content": output_text} )
Training, Testing, and Evaluation Datasets
Training Dataset:
- Link: Undisclosed
- Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated
- Labeling Method by dataset: Hybrid: Human, Automated
- Properties: The model was trained on a significant volume of data, with a substantial portion dedicated to Indian languages. Approximately one-third of the training samples were in 11 Indic languages (Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, and Telugu). Specifically, 30% of coding, math, and reasoning prompts, and 50% of the remaining prompts were translated into these languages. The datasets were designed to improve the model's capabilities in logical reasoning, mathematical problem-solving, and multilingual conversational tasks. No sensors were used for data collection.
Testing Dataset
- Link: Undisclosed
- Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated
- Labeling Method by dataset: Human
- Properties:
The testing datasets comprise a wide range of tasks and quantities:
- IFEval: Over 500 prompts to test instruction-following capabilities.
- GSM8K: 8,500 high-quality, linguistically diverse grade-school math word problems.
- MATH: 12,500 challenging math competition problems.
- Big Bench Hard (BBH): A set of 23 difficult tasks designed to test reasoning.
- MMLU: A comprehensive benchmark with multiple-choice questions covering 57 different subjects.
- HellaSwag: A dataset of 70,000 multiple-choice questions for commonsense natural language inference.
- WinoGrande: A collection of 44,000 problems for commonsense reasoning.
- ARC (AI2 Reasoning Challenge): A set of 7,787 science questions from grade-school examinations.
- TruthfulQA: A benchmark consisting of 817 questions across 38 categories to measure a model's truthfulness.
- Indic Language Benchmarks: Various datasets were used to evaluate performance in 11 Indian languages, including translations of standard benchmarks and Indic-specific evaluations.
Evaluation Dataset
-
Link: Undisclosed
-
Data Collection Method by dataset: Hybrid: Human, Synthetic, Automated
-
Labeling Method by dataset: Hybrid: Human, Automated
-
Properties:
-
Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
-
Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
-
Properties:
- IFEval: Contains over 500 prompts designed to test a model's ability to adhere to complex instructions
- GSM8K: A dataset of 8,500 high-quality, linguistically diverse grade-school math word problems requiring multi-step reasoning.
- MATH: A challenging dataset of 12,500 problems from mathematics competitions.
- Big Bench Hard (BBH): A collection of 23 difficult tasks that are beyond the capabilities of most current language models.
- MMLU: A massive multitask benchmark consisting of multiple-choice questions across 57 subjects, designed to test a model's general knowledge and problem-solving abilities.
- TruthfulQA: Comprises 817 questions across 38 categories, designed to measure a model's tendency to produce truthful answers.
- Indic Language Benchmarks: The model was also evaluated on its performance in 11 Indian languages, showing significant improvement over baseline models on various language tasks.
Benchmark Results
Results on multilingual benchmarks for 21 European languages with instruction-tuned models:
Model | Avg. | EU21-ARC | EU21-HeSw | EU21-TQA | EU21-MMLU |
---|---|---|---|---|---|
Meta-Llama-3.1-8B-Instruct | 0.563 | 0.563 | 0.579 | 0.532 | 0.576 |
Mistral-7B-Instruct-v0.3 | 0.527 | 0.530 | 0.538 | 0.548 | 0.491 |
Salamandra-7B-Instruct | 0.543 | 0.595 | 0.637 | 0.482 | 0.459 |
Aya-23-8B | 0.485 | 0.475 | 0.535 | 0.476 | 0.455 |
Occiglot-7B-eu5-Instruct | 0.475 | 0.484 | 0.519 | 0.471 | 0.428 |
Pharia-1-LLM-7B-C-A | 0.417 | 0.396 | 0.438 | 0.469 | 0.366 |
Bloomz-7B1 | 0.358 | 0.316 | 0.354 | 0.461 | 0.302 |
Teuken-7B-instruct-commercial-v0.4 | 0.531 | 0.569 | 0.620 | 0.503 | 0.430 |
Inference
Acceleration Engine: vLLM
Test Hardware:
- L40s x2
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.