Ministral 3 14B Instruct 2512
Description
Ministral 3 14B Instruct 2512 FP8 is the largest model in the Ministral 3 family, offering frontier capabilities and performance comparable to its larger Mistral Small 3.2 24B counterpart. A powerful and efficient language model with vision capabilities, this instruct post-trained version in FP8 precision is fine-tuned for instruction tasks, making it ideal for chat and instruction-based use cases.
The FP8 quantization enables deployment with reduced memory requirements while maintaining model quality, capable of fitting in 24GB of VRAM, and less if further quantized.
This model is ready for commercial/non-commercial use.
Third-Party Community Consideration:
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Ministral 3 14B Instruct 2512 FP8 Model Card
License and Terms of Use:
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Community Model License. Additional Information: Apache License Version 2.0.
Deployment Geography:
Global
Use Case:
Use Case: Designed for private AI deployments where advanced capabilities meet practical hardware constraints. Ideal for private/custom chat and AI assistant deployments in constrained environments, advanced local agentic use cases, fine-tuning and specialization, and bringing advanced AI capabilities to edge environments. The FP8 quantization enables efficient deployment on resource-constrained hardware.
Release Date:
Build.NVIDIA.com: 12/2025 via link
Huggingface: 12/2025 via link
Reference(s):
References:
- Ministral 3 Family on Hugging Face
- Ministral 3 14B Instruct BF16 Version
- Ministral 3 - More Formats Collection
- vLLM Framework
- Mistral Common Library
- System Prompt Configuration
Ministral 3 Family
| Model Name | Type | Precision | Link |
|---|---|---|---|
| Ministral 3 3B Base 2512 | Base pre-trained | BF16 | Hugging Face |
| Ministral 3 3B Instruct 2512 | Instruct post-trained | FP8 | Hugging Face |
| Ministral 3 3B Reasoning 2512 | Reasoning capable | BF16 | Hugging Face |
| Ministral 3 8B Base 2512 | Base pre-trained | BF16 | Hugging Face |
| Ministral 3 8B Instruct 2512 | Instruct post-trained | FP8 | Hugging Face |
| Ministral 3 8B Reasoning 2512 | Reasoning capable | BF16 | Hugging Face |
| Ministral 3 14B Base 2512 | Base pre-trained | BF16 | Hugging Face |
| Ministral 3 14B Instruct 2512 | Instruct post-trained | FP8 | Hugging Face |
| Ministral 3 14B Instruct 2512 FP8 | Instruct post-trained | FP8 | Hugging Face |
| Ministral 3 14B Reasoning 2512 | Reasoning capable | BF16 | Hugging Face |
Other formats and quantizations available in the Ministral 3 - More Formats collection.
Model Architecture:
Architecture Type: Transformer
Network Architecture: Ministral (13.5B Language Model + 0.4B Vision Encoder)
Total Parameters: 14B
Active Parameters: 14B
Vocabulary Size: Undisclosed
Base Model: mistralai/Ministral-3-14B-Base-2512
Input:
Input Types: Image, Text
Input Formats: Red, Green, Blue (RGB), String
Input Parameters: Two Dimensional (2D), One Dimensional (1D)
Other Input Properties: Supports multimodal input with up to 10 images per prompt. Images are processed through a 0.4B vision encoder. Text inputs support multilingual content (English, French, Spanish, German, Italian, Portuguese, Dutch, Chinese, Japanese, Korean, Arabic). Recommended system prompt configuration available in repository.
Input Context Length (ISL): 262,144
Output:
Output Types: Text
Output Format: String
Output Parameters: One Dimensional (1D)
Other Output Properties: Supports native function calling and JSON output formatting. Best results achieved with temperature=0.15. Strong system prompt adherence for tailored responses.
Output Context Length (OSL): Undisclosed
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engines:
- vLLM: 0.12.0 or higher (recommended)
- Transformers: Latest version with mistral-common >= 1.8.6
Supported Hardware:
- NVIDIA Ampere: A100 (80GB), A40, A30, A10, GeForce RTX 3090
- NVIDIA Blackwell: B200, B100, GB200
- NVIDIA Hopper: H100, H200
- NVIDIA Lovelace: L40, L40S, L4, GeForce RTX 4090, RTX 6000 Ada Generation
Operating Systems: Linux
Additional Testing Statement:
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s)
v1.0 (December 2025)
Training, Testing, and Evaluation Datasets:
Training Dataset
Data Modality: Undisclosed
Training Data Collection: Undisclosed
Training Labeling: Undisclosed
Training Properties: Undisclosed
Testing Dataset
Testing Data Collection: Undisclosed
Testing Labeling: Undisclosed
Testing Properties: Undisclosed
Evaluation Dataset
Evaluation Benchmark Score: Benchmark results are provided below comparing Ministral 3 14B to similar sized models across reasoning, instruct, and base model evaluations.
Reasoning Benchmarks
| Model | AIME25 | AIME24 | GPQA Diamond | LiveCodeBench |
|---|---|---|---|---|
| Ministral 3 14B | 0.850 | 0.898 | 0.712 | 0.646 |
| Qwen3-14B (Thinking) | 0.737 | 0.837 | 0.663 | 0.593 |
| Ministral 3 8B | 0.787 | 0.860 | 0.668 | 0.616 |
| Qwen3-VL-8B-Thinking | 0.798 | 0.860 | 0.671 | 0.580 |
| Ministral 3 3B | 0.721 | 0.775 | 0.534 | 0.548 |
| Qwen3-VL-4B-Thinking | 0.697 | 0.729 | 0.601 | 0.513 |
Instruct Benchmarks
| Model | Arena Hard | WildBench | MATH Maj@1 | MM MTBench |
|---|---|---|---|---|
| Ministral 3 14B | 0.551 | 68.5 | 0.904 | 8.49 |
| Qwen3 14B (Non-Thinking) | 0.427 | 65.1 | 0.870 | NOT MULTIMODAL |
| Gemma3-12B-Instruct | 0.436 | 63.2 | 0.854 | 6.70 |
| Ministral 3 8B | 0.509 | 66.8 | 0.876 | 8.08 |
| Qwen3-VL-8B-Instruct | 0.528 | 66.3 | 0.946 | 8.00 |
| Ministral 3 3B | 0.305 | 56.8 | 0.830 | 7.83 |
| Qwen3-VL-4B-Instruct | 0.438 | 56.8 | 0.900 | 8.01 |
| Qwen3-VL-2B-Instruct | 0.163 | 42.2 | 0.786 | 6.36 |
| Gemma3-4B-Instruct | 0.318 | 49.1 | 0.759 | 5.23 |
Base Model Benchmarks
| Model | Multilingual MMLU | MATH CoT 2-Shot | AGIEval 5-shot | MMLU Redux 5-shot | MMLU 5-shot | TriviaQA 5-shot |
|---|---|---|---|---|---|---|
| Ministral 3 14B | 0.742 | 0.676 | 0.648 | 0.820 | 0.794 | 0.749 |
| Qwen3 14B Base | 0.754 | 0.620 | 0.661 | 0.837 | 0.804 | 0.703 |
| Gemma 3 12B Base | 0.690 | 0.487 | 0.587 | 0.766 | 0.745 | 0.788 |
| Ministral 3 8B | 0.706 | 0.626 | 0.591 | 0.793 | 0.761 | 0.681 |
| Qwen 3 8B Base | 0.700 | 0.576 | 0.596 | 0.794 | 0.760 | 0.639 |
| Ministral 3 3B | 0.652 | 0.601 | 0.511 | 0.735 | 0.707 | 0.592 |
| Qwen 3 4B Base | 0.677 | 0.405 | 0.570 | 0.759 | 0.713 | 0.530 |
| Gemma 3 4B Base | 0.516 | 0.294 | 0.430 | 0.626 | 0.589 | 0.640 |
Evaluation Data Collection: Automated
Evaluation Labeling: Automated
Evaluation Properties: Standard industry benchmarks for reasoning (AIME25, AIME24, GPQA Diamond, LiveCodeBench), instruct (Arena Hard, WildBench, MATH Maj@1, MM MTBench), and base model performance (Multilingual MMLU, MATH CoT 2-Shot, AGIEval, MMLU Redux, MMLU, TriviaQA). Complete benchmark results available in the source model card linked above.
Inference
Acceleration Engine: Other (vLLM with mistral-common tokenizer)
Test Hardware: Due to its size and the FP8 format of its weights, Ministral 3 14B Instruct 2512 can run on a single 1xH200 GPU. The model requires approximately 24GB of VRAM in FP8 precision, and less if further quantized. Capable of running on edge devices and local deployments.
Additional Details
Recommended Deployment Settings
- Temperature: 0.15 (recommended for best results)
- Max Tokens: Up to 262,144 (256K context window)
- System Prompt: Use provided SYSTEM_PROMPT.txt for general assistant use
- Image Limit: Up to 10 images per prompt
Usage
The model can be used with the following frameworks:
vllm (recommended): See vLLM section belowtransformers: See Transformers section below
Note 1: We recommend using a relatively low temperature, such as temperature=0.15.
Note 2: Make sure to add a system prompt to the model to best tailor it to your needs. If you want to use the model as a general assistant, we recommend to use the one provided in the SYSTEM_PROMPT.txt file.
vLLM (recommended)
We recommend using this model with vLLM.
Installation
Make sure to install vLLM >= 0.12.0:
pip install vllm --upgrade
Doing so should automatically install mistral_common >= 1.8.6.
To check:
python -c "import mistral_common; print(mistral_common.__version__)"
You can also make use of a ready-to-go docker image or on the docker hub.
Serve
We recommend that you use Ministral 3 14B in a server/client setting.
- Spin up a server:
vllm serve mistralai/Ministral-3-14B-Instruct-2512 \
--enable-auto-tool-choice --tool-call-parser mistral
Note: Due to its size and the FP8 format of its weights, Ministral 3 14B Instruct 2512 can run on a single 1xH200 GPU.
Additional flags:
- You can set
--max-model-lento preserve memory. By default it is set to262144which is quite large but not necessary for most scenarios. - You can set
--max-num-batched-tokensto balance throughput and latency, higher means higher throughput but higher latency.
- To ping the client you can use a simple Python snippet. See the following examples.
Vision reasoning
Leverage the vision capabilities of Ministral 3 14B Instruct 2512 to make the best choice given a scenario.
Python snippet
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
TEMP = 0.15
MAX_TOK = 262144
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
model_id = "mistralai/Ministral-3-14B-Instruct-2512"
SYSTEM_PROMPT = load_system_prompt(model_id, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
max_tokens=MAX_TOK,
)
print(response.choices[0].message.content)
Function calling
Ministral 3 14B Instruct 2512 is excellent at function / tool calling tasks via vLLM.
Python snippet
import json
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
TEMP = 0.15
MAX_TOK = 262144
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
return system_prompt
model_id = "mistralai/Ministral-3-14B-Instruct-2512"
SYSTEM_PROMPT = load_system_prompt(model_id, "SYSTEM_PROMPT.txt")
image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"
def my_calculator(expression: str) -> str:
return str(eval(expression))
tools = [
{
"type": "function",
"function": {
"name": "my_calculator",
"description": "A calculator that can evaluate a mathematical expression.",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "The mathematical expression to evaluate.",
},
},
"required": ["expression"],
},
},
},
{
"type": "function",
"function": {
"name": "rewrite",
"description": "Rewrite a given text for improved clarity",
"parameters": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The input text to rewrite",
}
},
},
},
},
]
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
},
{
"type": "image_url",
"image_url": {
"url": image_url,
},
},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
max_tokens=MAX_TOK,
tools=tools,
tool_choice="auto",
)
tool_calls = response.choices[0].message.tool_calls
results = []
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = tool_call.function.arguments
if function_name == "my_calculator":
result = my_calculator(**json.loads(function_args))
results.append(result)
messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
messages.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call.function.name,
"content": result,
}
)
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
max_tokens=MAX_TOK,
)
print(response.choices[0].message.content)
Instruction following
Ministral 3 14B Instruct 2512 will follow your instructions precisely.
Python snippet
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
TEMP = 0.15
MAX_TOK = 262144
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
return system_prompt
model_id = "mistralai/Ministral-3-14B-Instruct-2512"
SYSTEM_PROMPT = load_system_prompt(model_id, "SYSTEM_PROMPT.txt")
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=TEMP,
max_tokens=MAX_TOK,
)
assistant_message = response.choices[0].message.content
print(assistant_message)
Transformers
You can also use Ministral 3 14B Instruct 2512 FP8 with Transformers.
Transformers very recently added preliminary support for FP8, so please make sure to install from main:
uv pip install git+https://github.com/huggingface/transformers
To make the best use of our model with Transformers make sure to have installed mistral-common >= 1.8.6 to use our tokenizer.
pip install mistral-common --upgrade
Try it out by running the following snippet.
[!Tip]
By default Transformers will load the checkpoint in FP8 and dequantize it to BF16 on the fly,
which means the model currently does not make use of accelerated FP8-kernels.
Compatibility with accelerated FP8-kernels is currently worked on and will be available in a couple of weeks.
Stay tuned!
Python snippet
import torch
from transformers import Mistral3ForConditionalGeneration, MistralCommonBackend
model_id = "mistralai/Ministral-3-14B-Instruct-2512"
tokenizer = MistralCommonBackend.from_pretrained(model_id)
model = Mistral3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
tokenized = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True)
tokenized["input_ids"] = tokenized["input_ids"].to(device="cuda")
tokenized["pixel_values"] = tokenized["pixel_values"].to(dtype=torch.bfloat16, device="cuda")
image_sizes = [tokenized["pixel_values"].shape[-2:]]
output = model.generate(
**tokenized,
image_sizes=image_sizes,
max_new_tokens=512,
)[0]
decoded_output = tokenizer.decode(output[len(tokenized["input_ids"][0]):])
print(decoded_output)
Note:
Transformers allows you to automatically convert the checkpoint to Bfloat16. To do so, simply load the model as follows:
from transformers import Mistral3ForConditionalGeneration, FineGrainedFP8Config
model_id = "mistralai/Ministral-3-14B-Instruct-2512"
model = Mistral3ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
quantization_config=FineGrainedFP8Config(dequantize=True)
)
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please make sure you have proper rights and permissions for all input image content; if image includes people, personal health information, or intellectual property, the image generated will not blur or maintain proportions of image subjects included.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
