qwen / qwen3-next-80b-a3b-thinking

Qwen3-Next-80B-A3B-Thinking

Description

Qwen3-Next-80B-A3B-Thinking is a part of the Qwen3-Next series that features the following key enchancements:

Hybrid Attention: Replaces standard attention with the combination of Gated DeltaNet and Gated Attention, enabling efficient context modeling for ultra-long context length.
High-Sparsity Mixture-of-Experts (MoE): Achieves an extreme low activation ratio in MoE layers, drastically reducing FLOPs per token while preserving model capacity.
Stability Optimizations: Includes techniques such as zero-centered and weight-decayed layernorm, and other stabilizing enhancements for robust pre-training and post-training.
Multi-Token Prediction (MTP): Boosts pretraining model performance and accelerates inference.

For more details, please refer to the Qwen3-Next blog post.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration:

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA model card here: Qwen3-Next-80B-A3B-Thinking.

License and Terms of Use:

GOVERNING TERMS: The trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Community Model License. ADDITIONAL INFORMATION: Apache 2.0 License.

Deployment Geography:

Global

Use Case:

Qwen3-Next-80B-A3B-Thinking excels in tool calling capabilities and highly complex reasoning tasks.

Release Date:

build.nvidia.com: September 11, 2025 via Qwen3-Next-80B-A3B-Thinking

Hugging Face: September 11, 2025 via Qwen3-Next-80B-A3B-Thinking

Reference(s):

References:

Model Architecture:

Architecture Type: Other (Hybrid Transformer-Mamba)

Network Architecture: Qwen3-Next

Total Parameters: 80B

Active Parameters: 3.9B

Vocabulary Size: 151,936

Input:

Input Types: Text

Input Formats: String

Input Parameters: One Dimensional (1D)

Other Input Properties: Qwen3-Next natively supports context lengths of up to 262,144 tokens

Qwen3-Next-80B-A3B-Thinking supports only thinking mode.
To enforce model thinking, the default chat template automatically includes <think>.
Therefore, it is normal for the model's output to contain only </think> without an explicit opening <think> tag.

Output:

Output Types: Text

Output Format: String

Output Parameters: One Dimensional (1D)

Other Output Properties: Qwen3-Next natively supports context lengths of up to 262,144 tokens

Qwen3-Next-80B-A3B-Thinking may generate thinking content longer than its predecessor.
Alibaba strongly recommends its use in highly complex reasoning tasks.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engines:

Inference: SGLang, vLLM

Supported Hardware:

NVIDIA Hopper: H20, H100

Operating Systems: Linux

Model Version(s)

Qwen3-Next-80B-A3B-Thinking v1.0 (September 11, 2025)

Training, Testing, and Evaluation Datasets:

Training Dataset

Training Data Collection: Undisclosed

Training Labeling: Undisclosed

Training Properties: Undisclosed

Testing Dataset

Testing Data Collection: Undisclosed

Testing Labeling: Undisclosed

Testing Properties: Undisclosed

Evaluation Dataset

Evaluation Data Collection: Undisclosed

Evaluation Labeling: Undisclosed

Evaluation Properties: Undisclosed

Evaluation Benchmarks:
| Benchmark | Qwen3-30B-A3B-Thinking-2507 | Qwen3-32B Thinking | Qwen3-235B-A22B-Thinking-2507 | Gemini-2.5-Flash Thinking | Qwen3-Next-80B-A3B-Thinking |
|--- | --- | --- | --- | --- | --- |
| Knowledge | | | | |
| MMLU-Pro | 80.9 | 79.1 | 84.4 | 81.9 | 82.7 |
| MMLU-Redux | 91.4 | 90.9 | 93.8 | 92.1 | 92.5 |
| GPQA | 73.4 | 68.4 | 81.1 | 82.8 | 77.2 |
| SuperGPQA | 56.8 | 54.1 | 64.9 | 57.8 | 60.8 |
| Reasoning | | | | |
| AIME25 | 85.0 | 72.9 | 92.3 | 72.0 | 87.8 |
| HMMT25 | 71.4 | 51.5 | 83.9 | 64.2 | 73.9 |
| LiveBench 241125 | 76.8 | 74.9 | 78.4 | 74.3 | 76.6 |
| Coding | | | | |
| LiveCodeBench v6 (25.02-25.05) | 66.0 | 60.6 | 74.1 | 61.2 | 68.7 |
| CFEval | 2044 | 1986 | 2134 | 1995 | 2071 |
| OJBench | 25.1 | 24.1 | 32.5 | 23.5 | 29.7 |
| Alignment | | | | |
| IFEval | 88.9 | 85.0 | 87.8 | 89.8 | 88.9 |
| Arena-Hard v2* | 56.0 | 48.4 | 79.7 | 56.7 | 62.3 |
| WritingBench | 85.0 | 79.0 | 88.3 | 83.9 | 84.6 |
| Agent | | | | |
| BFCL-v3 | 72.4 | 70.3 | 71.9 | 68.6 | 72.0 |
| TAU1-Retail | 67.8 | 52.8 | 67.8 | 65.2 | 69.6 |
| TAU1-Airline | 48.0 | 29.0 | 46.0 | 54.0 | 49.0 |
| TAU2-Retail | 58.8 | 49.7 | 71.9 | 66.7 | 67.8 |
| TAU2-Airline | 58.0 | 45.5 | 58.0 | 52.0 | 60.5 |
| TAU2-Telecom | 26.3 | 27.2 | 45.6 | 31.6 | 43.9 |
| Multilingualism | | | | |
| MultiIF | 76.4 | 73.0 | 80.6 | 74.4 | 77.8 |
| MMLU-ProX | 76.4 | 74.6 | 81.0 | 80.2 | 78.7 |
| INCLUDE | 74.4 | 73.7 | 81.0 | 83.9 | 78.9 |
| PolyMATH | 52.6 | 47.4 | 60.1 | 49.8 | 56.3 |

*For reproducibility, Alibaba reports the win rates evaluated by GPT-4.1.

Inference

Acceleration Engine: SGLang

Test Hardware: NVIDIA H100

Additional Details

Qwen3-Next-80B-A3B-Thinking has the following features:

Type: Causal Language Models
Training Stage: Pre-training (15T tokens) & Post-training
Number of Parameters: 80B in total and 3.9B activated
Number of Paramaters (Non-Embedding): 79B
Number of Layers: 48
Hidden Dimension: 2048
Hybrid Layout: 12 * (3 * (Gated DeltaNet -> MoE) -> (Gated Attention -> MoE))
Gated Attention:
- Number of Attention Heads: 16 for Q and 2 for KV
- Head Dimension: 256
- Rotary Position Embedding Dimension: 64
Gated DeltaNet:
- Number of Linear Attention Heads: 32 for V and 16 for QK
- Head Dimension: 128
Mixture of Experts:
- Number of Experts: 512
- Number of Activated Experts: 10
- Number of Shared Experts: 1
- Expert Intermediate Dimension: 512
Context Length: 262,144 natively and extensible up to 1,010,000 tokens

Ethical Considerations

NVIDIA believes Trustworthy Al is a shared responsibility and we have established policies and practices to enable development for a wide array of Al applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA Al Concerns here.