Qwen3-Next-80B-A3B-Thinking
Description
Qwen3-Next-80B-A3B-Thinking is a part of the Qwen3-Next series that features the following key enchancements:
- Hybrid Attention: Replaces standard attention with the combination of Gated DeltaNet and Gated Attention, enabling efficient context modeling for ultra-long context length.
- High-Sparsity Mixture-of-Experts (MoE): Achieves an extreme low activation ratio in MoE layers, drastically reducing FLOPs per token while preserving model capacity.
- Stability Optimizations: Includes techniques such as zero-centered and weight-decayed layernorm, and other stabilizing enhancements for robust pre-training and post-training.
- Multi-Token Prediction (MTP): Boosts pretraining model performance and accelerates inference.
For more details, please refer to the Qwen3-Next blog post.
This model is ready for commercial/non-commercial use.
Third-Party Community Consideration:
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA model card here: Qwen3-Next-80B-A3B-Thinking.
License and Terms of Use:
GOVERNING TERMS: The trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Community Model License. ADDITIONAL INFORMATION: Apache 2.0 License.
Deployment Geography:
Global
Use Case:
Qwen3-Next-80B-A3B-Thinking excels in tool calling capabilities and highly complex reasoning tasks.
Release Date:
build.nvidia.com: September 11, 2025 via Qwen3-Next-80B-A3B-Thinking
Hugging Face: September 11, 2025 via Qwen3-Next-80B-A3B-Thinking
Reference(s):
References:
- Qwen3-Next-80B-A3B-Thinking
- Qwen3-Next-80B-A3B-Instruct
- Qwen3-Next: Towards Ultimate Training & Inference Efficiency
- Gated Delta Networks: Improving Mamba2 with Delta Rule
- Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
- DeepSeek-V3 Technical Report
- Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
- Efficient Streaming Language Models with Attention Sinks
- Massive Activations in Large Language Models
- Gemma: Open Models Based on Gemini Research and Technology
- Approximating Two-Layer Feedforward Networks for Efficient Transformers
- Better & Faster Large Language Models via Multi-token Prediction
Model Architecture:
Architecture Type: Other (Hybrid Transformer-Mamba)
Network Architecture: Qwen3-Next
Total Parameters: 80B
Active Parameters: 3.9B
Vocabulary Size: 151,936
Input:
Input Types: Text
Input Formats: String
Input Parameters: One Dimensional (1D)
Other Input Properties: Qwen3-Next natively supports context lengths of up to 262,144 tokens
Qwen3-Next-80B-A3B-Thinking supports only thinking mode.
To enforce model thinking, the default chat template automatically includes<think>
.
Therefore, it is normal for the model's output to contain only</think>
without an explicit opening<think>
tag.
Output:
Output Types: Text
Output Format: String
Output Parameters: One Dimensional (1D)
Other Output Properties: Qwen3-Next natively supports context lengths of up to 262,144 tokens
Qwen3-Next-80B-A3B-Thinking may generate thinking content longer than its predecessor.
Alibaba strongly recommends its use in highly complex reasoning tasks.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engines:
- Inference: SGLang, vLLM
Supported Hardware:
- NVIDIA Hopper: H20, H100
Operating Systems: Linux
Model Version(s)
Qwen3-Next-80B-A3B-Thinking v1.0 (September 11, 2025)
Training, Testing, and Evaluation Datasets:
Training Dataset
Training Data Collection: Undisclosed
Training Labeling: Undisclosed
Training Properties: Undisclosed
Testing Dataset
Testing Data Collection: Undisclosed
Testing Labeling: Undisclosed
Testing Properties: Undisclosed
Evaluation Dataset
Evaluation Data Collection: Undisclosed
Evaluation Labeling: Undisclosed
Evaluation Properties: Undisclosed
Evaluation Benchmarks:
| Benchmark | Qwen3-30B-A3B-Thinking-2507 | Qwen3-32B Thinking | Qwen3-235B-A22B-Thinking-2507 | Gemini-2.5-Flash Thinking | Qwen3-Next-80B-A3B-Thinking |
|--- | --- | --- | --- | --- | --- |
| Knowledge | | | | |
| MMLU-Pro | 80.9 | 79.1 | 84.4 | 81.9 | 82.7 |
| MMLU-Redux | 91.4 | 90.9 | 93.8 | 92.1 | 92.5 |
| GPQA | 73.4 | 68.4 | 81.1 | 82.8 | 77.2 |
| SuperGPQA | 56.8 | 54.1 | 64.9 | 57.8 | 60.8 |
| Reasoning | | | | |
| AIME25 | 85.0 | 72.9 | 92.3 | 72.0 | 87.8 |
| HMMT25 | 71.4 | 51.5 | 83.9 | 64.2 | 73.9 |
| LiveBench 241125 | 76.8 | 74.9 | 78.4 | 74.3 | 76.6 |
| Coding | | | | |
| LiveCodeBench v6 (25.02-25.05) | 66.0 | 60.6 | 74.1 | 61.2 | 68.7 |
| CFEval | 2044 | 1986 | 2134 | 1995 | 2071 |
| OJBench | 25.1 | 24.1 | 32.5 | 23.5 | 29.7 |
| Alignment | | | | |
| IFEval | 88.9 | 85.0 | 87.8 | 89.8 | 88.9 |
| Arena-Hard v2* | 56.0 | 48.4 | 79.7 | 56.7 | 62.3 |
| WritingBench | 85.0 | 79.0 | 88.3 | 83.9 | 84.6 |
| Agent | | | | |
| BFCL-v3 | 72.4 | 70.3 | 71.9 | 68.6 | 72.0 |
| TAU1-Retail | 67.8 | 52.8 | 67.8 | 65.2 | 69.6 |
| TAU1-Airline | 48.0 | 29.0 | 46.0 | 54.0 | 49.0 |
| TAU2-Retail | 58.8 | 49.7 | 71.9 | 66.7 | 67.8 |
| TAU2-Airline | 58.0 | 45.5 | 58.0 | 52.0 | 60.5 |
| TAU2-Telecom | 26.3 | 27.2 | 45.6 | 31.6 | 43.9 |
| Multilingualism | | | | |
| MultiIF | 76.4 | 73.0 | 80.6 | 74.4 | 77.8 |
| MMLU-ProX | 76.4 | 74.6 | 81.0 | 80.2 | 78.7 |
| INCLUDE | 74.4 | 73.7 | 81.0 | 83.9 | 78.9 |
| PolyMATH | 52.6 | 47.4 | 60.1 | 49.8 | 56.3 |
*For reproducibility, Alibaba reports the win rates evaluated by GPT-4.1.
Inference
Acceleration Engine: SGLang
Test Hardware: NVIDIA H100
Additional Details
Qwen3-Next-80B-A3B-Thinking has the following features:
- Type: Causal Language Models
- Training Stage: Pre-training (15T tokens) & Post-training
- Number of Parameters: 80B in total and 3.9B activated
- Number of Paramaters (Non-Embedding): 79B
- Number of Layers: 48
- Hidden Dimension: 2048
- Hybrid Layout: 12 * (3 * (Gated DeltaNet -> MoE) -> (Gated Attention -> MoE))
- Gated Attention:
- Number of Attention Heads: 16 for Q and 2 for KV
- Head Dimension: 256
- Rotary Position Embedding Dimension: 64
- Gated DeltaNet:
- Number of Linear Attention Heads: 32 for V and 16 for QK
- Head Dimension: 128
- Mixture of Experts:
- Number of Experts: 512
- Number of Activated Experts: 10
- Number of Shared Experts: 1
- Expert Intermediate Dimension: 512
- Context Length: 262,144 natively and extensible up to 1,010,000 tokens
Ethical Considerations
NVIDIA believes Trustworthy Al is a shared responsibility and we have established policies and practices to enable development for a wide array of Al applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA Al Concerns here.