MiniMax-M2.1

Description

MiniMax-M2.1 is a large language model optimized for agentic capabilities including coding, tool use, instruction following, and long-horizon planning. The model is designed to shatter the stereotype that high-performance agents must remain behind closed doors, enabling developers to build autonomous applications for multilingual software development and complex multi-step workflows.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration:

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA MiniMax-M2.1 Model Card

License and Terms of Use:

GOVERNING TERMS: Your use of the service is governed by the NVIDIA API Catalog Terms of Service. Your use of the model is governed by the NVIDIA Open Model License Agreement. ADDITIONAL INFORMATION: Modified MIT License.

Deployment Geography:

Global

Use Case:

Use Case: Developers and enterprises building autonomous AI agents for software engineering tasks, multilingual code development, automated workflows, tool calling, and long-horizon planning applications.

Release Date:

Build.NVIDIA.com: 01/2026 via link
Huggingface: 12/20/2025 via link

Reference(s):

References:

Model Architecture:

Architecture Type: Transformer
Network Architecture: Mixture-of-Experts Transformer
Total Parameters: 230B

Input:

Input Types: Text
Input Formats: String
Input Parameters: One Dimensional (1D)
Other Input Properties: Input text is tokenized using the model's native tokenizer. Recommended inference parameters: temperature=1.0, top_p=0.95, top_k=40.

Output:

Output Types: Text
Output Format: String
Output Parameters: One Dimensional (1D)
Other Output Properties: Generated text responses with support for tool calling and structured outputs.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engines:

SGLang: Recommended for serving MiniMax-M2.1
vLLM: Recommended for serving MiniMax-M2.1
Transformers: Supported for local deployment
Other: KTransformers

Supported Hardware:

NVIDIA Ampere: A100, A6000, A40
NVIDIA Blackwell: B200, B100, GB200
NVIDIA Hopper: H100, H200
NVIDIA Lovelace: L40S, L40, RTX 6000 Ada Generation

Preferred/Supported Operating Systems: Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

MiniMax-M2.1 v2.1

Training, Testing, and Evaluation Datasets:

Training Dataset

Data Modality: Text
Training Data Collection: Undisclosed
Training Labeling: Undisclosed
Training Properties: Undisclosed

Testing Dataset

Testing Data Collection: Undisclosed
Testing Labeling: Undisclosed
Testing Properties: Undisclosed

Evaluation Dataset

Evaluation Benchmark Score: MiniMax-M2.1 achieves 74.0% on SWE-bench Verified, 49.4% on Multi-SWE-bench, 72.5% on SWE-bench Multilingual, and 47.9% on Terminal-bench 2.0. The model demonstrates strong performance across coding, tool use, and full-stack development benchmarks.

Detailed Benchmark Comparison Table

Benchmark	MiniMax-M2.1	MiniMax-M2	Claude Sonnet 4.5	Claude Opus 4.5	Gemini 3 Pro	GPT-5.2 (thinking)	DeepSeek V3.2
SWE-bench Verified	74.0	69.4	77.2	80.9	78.0	80.0	73.1
Multi-SWE-bench	49.4	36.2	44.3	50.0	42.7	x	37.4
SWE-bench Multilingual	72.5	56.5	68	77.5	65.0	72.0	70.2
Terminal-bench 2.0	47.9	30.0	50.0	57.8	54.2	54.0	46.4

Benchmark	MiniMax-M2.1	MiniMax-M2	Claude Sonnet 4.5	Claude Opus 4.5	Gemini 3 Pro	GPT-5.2 (thinking)	DeepSeek V3.2
SWE-bench Verified (Droid)	71.3	68.1	72.3	75.2	x	x	67.0
SWE-bench Verified (mini-swe-agent)	67.0	61.0	70.6	74.4	71.8	74.2	60.0
SWT-bench	69.3	32.8	69.5	80.2	79.7	80.7	62.0
SWE-Perf	3.1	1.4	3.0	4.7	6.5	3.6	0.9
SWE-Review	8.9	3.4	10.5	16.2	x	x	6.4
OctoCodingbench	26.1	13.3	22.8	36.2	22.9	x	26.0

Benchmark	MiniMax-M2.1	MiniMax-M2	Claude Sonnet 4.5	Claude Opus 4.5	Gemini 3 Pro
VIBE (Average)	88.6	67.5	85.2	90.7	82.4
VIBE-Web	91.5	80.4	87.3	89.1	89.5
VIBE-Simulation	87.1	77.0	79.1	84.0	89.2
VIBE-Android	89.7	69.2	87.5	92.2	78.7
VIBE-iOS	88.0	39.5	81.2	90.0	75.8
VIBE-Backend	86.7	67.8	90.8	98.0	78.7

Benchmark	MiniMax-M2.1	MiniMax-M2	Claude Sonnet 4.5	Claude Opus 4.5	Gemini 3 Pro	GPT-5.2 (thinking)	DeepSeek V3.2
Toolathlon	43.5	16.7	38.9	43.5	36.4	41.7	35.2
BrowseComp	47.4	44.0	19.6	37.0	37.8	65.8	51.4
BrowseComp (context management)	62.0	56.9	26.1	57.8	59.2	70.0	67.6
AIME25	83.0	78.0	88.0	91.0	96.0	98.0	92.0
MMLU-Pro	88.0	82.0	88.0	90.0	90.0	87.0	86.0
GPQA-D	83.0	78.0	83.0	87.0	91.0	90.0	84.0
HLE w/o tools	22.2	12.5	17.3	28.4	37.2	31.4	22.2
LCB	81.0	83.0	71.0	87.0	92.0	89.0	86.0
SciCode	41.0	36.0	45.0	50.0	56.0	52.0	39.0
IFBench	70.0	72.0	57.0	58.0	70.0	75.0	61.0
AA-LCR	62.0	61.0	66.0	74.0	71.0	73.0	65.0
τ²-Bench Telecom	87.0	87.0	78.0	90.0	87.0	85.0	91.0

Evaluation Methodology Notes:

SWE-bench Verified: Tested on internal infrastructure using Claude Code, Droid, or mini-swe-agent as scaffolding. Default system prompt was overridden. Results represent the average of 4 runs.
Multi-SWE-Bench & SWE-bench Multilingual & SWT-bench & SWE-Perf: Tested using Claude Code as scaffolding, with default system prompt overridden. Results represent the average of 4 runs.
Terminal-bench 2.0: Tested using Claude Code. Full dataset verified and environmental issues fixed. Timeout limits removed, other configurations consistent with official settings. Average of 4 runs.
SWE Review: Internal benchmark for code defect review covering diverse languages and scenarios. Evaluates both defect recall and hallucination rates. "Correct" only if model accurately identifies target defect with no hallucinations. Average of 4 runs.
OctoCodingbench: Internal benchmark for long-horizon instruction following in complex development scenarios. Uses "single-violation-failure" scoring mechanism. Average of 4 runs.
VIBE: Uses Claude Code as scaffolding to automatically verify interactive logic and visual effects. Unified pipeline with containerized deployment and dynamic interaction environments. Average of 3 runs.
Toolathlon: Evaluation protocol consistent with original paper.
BrowseComp: Same agent framework as WebExplorer with minor tool description fine-tuning. Uses 103-sample GAIA text-only validation subset.
BrowseComp (context management): When token usage exceeds 30% of max context window, retains first AI response, last five AI responses, and tool outputs.
AIME25 ~ τ²-Bench Telecom: Based on evaluation datasets and methodology from Artificial Analysis Intelligence Index.

Evaluation Data Collection: Hybrid: Automated, Human
Evaluation Labeling: Hybrid: Automated, Human
Evaluation Properties: See Evaluation Methodology Notes above for detailed testing conditions per benchmark.

Inference

Acceleration Engine: SGLang
Test Hardware: H100x4

Additional Details

Recommended Inference Parameters

Temperature: 1.0
Top-p: 0.95
Top-k: 40

Default System Prompt

You are a helpful assistant. Your name is MiniMax-M2.1 and is built by MiniMax.

Tool Calling

MiniMax-M2.1 supports tool calling capabilities. Refer to the Tool Calling Guide for implementation details.

Deployment Options

API Access: Available via MiniMax Open Platform
MiniMax Agent: Production deployment available at agent.minimax.io
Local Deployment: Supported via SGLang, vLLM, or Transformers

Known Capabilities

Multilingual software development
Complex multi-step office workflows
Long-horizon planning
Tool use and function calling
Code generation and review
Test case generation
Code performance optimization

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.