Phi-4-mini-flash-reasoning Overview
Description
Phi-4-mini-flash-reasoning is a lightweight, 3.8 B‑parameter open model built upon synthetic, reasoning‑dense data and fine‑tuned specifically for advanced mathematical reasoning. It belongs to the Phi‑4 family and supports 64 K‑token context length.
This model is ready for commercial/non-commercial use.
Third‑Party Community Consideration
This model is not owned or developed by NVIDIA. It has been produced to a third-party’s requirements for this application and use-case. See the external card: Phi-4-mini-flash-reasoning Model Card.
License / Terms of Use
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Community Model License. Additional Information: MIT.
Deployment Geography
Global
Intended Uses
Primary Use Cases
- Multi‑step, logic‑intensive mathematical problem solving under memory/compute‑constrained and latency‑bound scenarios.
- Formal proof generation, symbolic computation, advanced word‑problem solutions, and other math‑reasoning tasks that require deep analytical thinking.
Use‑Case Considerations
This model is designed and evaluated for math reasoning only. Developers should carefully evaluate accuracy, safety, fairness, multilingual performance, and legal compliance before use—especially in high‑risk domains.
Release Date
Build.NVIDIA.com: 07/18/2025 (link)
Hugging Face: 07/09/2025 (link)
References
Model Architecture
- Architecture Type: Transformer‑based “Hybrid SambaY” with Differential Attention.
- Parameters: 3.8 B.
- Vocabulary: 200,064 tokens.
- Context Length: 64K tokens.
- Key Components: State‑space modules, grouped‑query attention, gated memory sharing, shared KV cache with one global‑attention layer, shared input/output embeddings.
- Base Model: Phi-4-mini-flash
Inputs
Input Type(s): Text
Input Formats: String
Input Parameters: One-Dimensional (1D)
Other Properties Related to Input: Up to 64K tokens context length. It is best suited for prompts using the chat format.
Output
Output Type(s): Text
Output Formats: String
Output Parameters: One-Dimensional (1D)
Other Properties Related to Output: Generates step-by-step math reasoning up to 32K tokens
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration
Supported Hardware
Note that by default, the Phi-4-mini-flash-reasoning model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
- NVIDIA A100 (Ampere)
- NVIDIA H100 (Hopper)
Supported Operating System(s): Linux
Model Version:
phi-4-mini-flash-reasoning v1.0
Training, Testing & Evaluation Datasets
Training Dataset:
Data Collection Method by dataset: Hybrid: Synthetic, Automated, Human
Labeling Method by dataset: Hybrid: Synthetic, Automated
Properties:
- Quantity ≈ 1,000,000 math problems, ≈ 30B text tokens
- Description Three components:
- Curated publicly-available math questions + base SFT data
- DeepSeek-R1 synthetic math corpus spanning middle-school → Ph.D. difficulty
- Balanced correct/incorrect answer pairs for preference learning
Additional Training Details:
- GPUs: Pre-training: 1024 A100-80G; Reasoning training: 128 H100-80G
- Training time: Pre-training: 14 days; Reasoning training: 2days
- Training data: Pre-training: 5T tokens; Reasoning training: 150B tokens
- Dates: Trained in May 2025
- Status: This is a static model trained on offline datasets with the cutoff date of February 2025 for publicly available data.
Testing Dataset:
Data Collection Method by dataset: Undisclosed
Labeling Method by dataset: Undisclosed
Properties: Undisclosed
Evaluation Dataset:
Benchmark Score: Pass@1 – AIME 2024 52.29 % / AIME 2025 33.59 % / Math-500 92.45 % / GPQA-Diamond 45.08 %
Data Collection Method by dataset Public benchmark datasets
Labeling Method by dataset: Benchmark gold answers
Properties: Math-500, AIME 2024, AIME 2025, GPQA-Diamond
Evaluation Benchmarks
To understand the capabilities, the 3.8B parameters Phi-4-mini-flash-reasoning model was compared with a set of models over a variety of reasoning benchmarks.
We use a more accurate evaluation where Pass@1 accuracy is averaged over 64 samples for AIME24/25 and 8 samples for Math500 and GPQA Diamond. A high-level overview of the model quality is as follows:
Model | AIME 2024 | AIME 2025 | Math‑500 | GPQA Diamond |
---|---|---|---|---|
DeepSeek‑R1‑Distill‑Qwen‑1.5B | 29.58 | 20.78 | 84.50 | 37.69 |
DeepSeek‑R1‑Distill‑Qwen‑7B | 53.70 | 35.94 | 93.03 | 47.85 |
DeepSeek‑R1‑Distill‑Llama‑8B | 43.96 | 27.34 | 87.48 | 45.83 |
Bespoke‑Stratos‑7B | 21.51 | 18.28 | 80.73 | 38.51 |
OpenThinker‑7B | 29.69 | 24.32 | 87.25 | 41.60 |
Phi‑4‑mini‑Reasoning (3.8 B) | 48.13 | 31.77 | 91.20 | 44.51 |
Phi‑4‑mini‑Flash‑Reasoning (3.8 B) | 52.29 | 33.59 | 92.45 | 45.08 |
Overall, the model with only 3.8B-param achieves a similar level of math and science reasoning ability as much larger models.
However, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much factual knowledge, therefore, users may experience factual incorrectness. However, it may be possible to resolve such weakness by augmenting Phi-4-mini-flash-reasoning with a search engine, particularly when using the model under RAG settings.
Benchmark Methodology
We include a brief word on methodology here - and in particular, how we think about optimizing prompts. In an ideal world, we would never change any prompts in our benchmarks to ensure it is always an apples-to-apples comparison when comparing different models. Indeed, this is our default approach, and is the case in the vast majority of models we have run to date. For all benchmarks, we consider using the same generation configuration such as max sequence length (32768), the same temperature for the fair comparison.
Benchmark datasets
We evaluate the model with three of the most popular math benchmarks where the strongest reasoning models are competing together. Specifically:
- Math-500: This benchmark consists of 500 challenging math problems designed to test the model's ability to perform complex mathematical reasoning and problem-solving.
- AIME 2024/AIME 2025: The American Invitational Mathematics Examination (AIME) is a highly regarded math competition that features a series of difficult problems aimed at assessing advanced mathematical skills and logical reasoning. We evaluate the models on the problems from both 2024 and the year 2025 examinations.
- GPQA Diamond: The Graduate-Level Google-Proof Q&A (GPQA) Diamond benchmark focuses on evaluating the model's ability to understand and solve a wide range of mathematical questions, including both straightforward calculations and more intricate problem-solving tasks.
Inference:
Acceleration Engine: vLLM
Test Hardware: NVIDIA Ada Lovelace
Additional Details:
This compact 3.8 B model achieves near-linear latency growth with token count (2–3× lower latency and up to 10× higher throughput than Phi-4-mini-reasoning on a single A100-80 GB).
Model Efficiency
The two figures below compare the latency and throughput performance of the Phi-4-mini-reasoning and Phi-4-mini-flash-reasoning models under the vLLM inference framework. All evaluations were performed on a single NVIDIA A100-80GB GPU with tensor parallelism disabled (TP = 1). The Phi-4-mini-flash-reasoning model, which incorporates a decoder-hybrid-decoder architecture with attention and state space model (SSM), exhibits significantly greater computational efficiency—achieving a 2–3× reduction in average latency and up-to a 10× improvement in throughput when processing user requests with 2K prompt length and 32K generation length. Furthermore, Phi-4-mini-flash-reasoning demonstrates near-linear growth in latency with respect to the number of tokens generated (up to 32k), in contrast to the quadratic growth observed in Phi-4-mini-reasoning. These findings indicate that Phi-4-mini-flash-reasoning is more scalable and better suited for long-sequence generation tasks.
Usage
Tokenizer
Phi-4-mini-flash-reasoning supports a vocabulary size of up to 200064
tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.
Input Formats
Given the nature of the training data, the Phi-4-mini-flash-reasoning
model is best suited for prompts using specific formats.
Below are the two primary formats:
Chat format
This format is used for general conversation and instructions:
<|system|>Your name is Phi, an AI math expert developed by Microsoft.<|end|><|user|>How to solve 3*x^2+4*x+5=1?<|end|><|assistant|>
Inference with transformers
Phi-4-mini-flash-reasoning has been integrated in the 4.51.3
version of transformers
. The current transformers
version can be verified with: pip list | grep transformers
.
Python 3.8 and 3.10 will work best.
List of required packages:
flash_attn==2.7.4.post1
torch==2.6.0
mamba-ssm==2.2.4
causal-conv1d==1.5.0.post8
transformers==4.46.1
accelerate==1.4.0
Phi-4-mini-flash-reasoning is also available in Azure AI Studio
Example
After obtaining the Phi-4-mini-flash-reasoning model checkpoints, users can use this sample code for inference.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model_id = "microsoft/Phi-4-mini-flash-reasoning"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [{
"role": "user",
"content": "How to solve 3*x^2+4*x+5=1?"
}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
outputs = model.generate(
**inputs.to(model.device),
max_new_tokens=32768,
temperature=0.6,
top_p=0.95,
do_sample=True,
)
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
Safety Evaluation and Red-Teaming
The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed to do the safety alignment is a combination of SFT, DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories.
Phi-4-Mini-Flash-Reasoning was developed in accordance with Microsoft's responsible AI principles. Potential safety risks in the model’s responses were assessed using the Azure AI Foundry’s Risk and Safety Evaluation framework, focusing on harmful content, direct jailbreak, and model groundedness. The Phi-4-Mini-Flash-Reasoning Model Card contains additional information about our approach to safety and responsible AI considerations that developers should be aware of when using this model.
Responsible AI Considerations
Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
- Quality of Service: The Phi models are trained primarily on English text and some additional multilingual text. Languages other than English will experience worse performance as well as performance disparities across non-English. English language varieties with less representation in the training data might experience worse performance than standard American English.
- Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards.
- Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
- Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case.
- Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
- Election Information Reliability : The model has an elevated defect rate when responding to election-critical queries, which may result in incorrect or unauthoritative election critical information being presented. We are working to improve the model's performance in this area. Users should verify information related to elections with the election authority in their region.
- Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses.
- Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift.
Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include:
- Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques.
- High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.
- Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).
- Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
- Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
Ethical Considerations:
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here.