Qwen3.5-397B-A17B
Description
Qwen3.5-397B-A17B is a multimodal foundation model featuring a Hybrid Mixture-of-Experts architecture with early fusion vision-language training, designed for state-of-the-art performance across chat, retrieval-augmented generation, vision-language understanding, video understanding, and agentic workflows. Key highlights include:
- Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with prior text-only models and outperforms dedicated vision-language models across reasoning, coding, agents, and visual understanding benchmarks.
- Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.
- Scalable RL Generalization: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.
- Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.
- Thinking Mode: Operates in thinking mode by default, generating internal reasoning content (
<think>...</think>) before producing final responses, with an option to disable for direct responses.
This model is ready for commercial/non-commercial use.
Third-Party Community Consideration:
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Qwen3.5-397B-A17B Model Card
License and Terms of Use:
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information: Apache 2.0 License.
Deployment Geography:
Global
Use Case:
Use Case: Designed for developers and enterprises building multimodal AI applications including conversational chat, retrieval-augmented generation (RAG), vision-language understanding and reasoning, tool/function calling, and agentic workflows.
Release Date:
Build.NVIDIA.com: 02/16/2026 via link
Huggingface: 02/16/2026 via link
Model Architecture:
Architecture Type: Transformer (Causal Language Model with Vision Encoder)
Network Architecture: Qwen3.5 (Hybrid Mixture-of-Experts with Gated DeltaNet)
Total Parameters: 397B
Active Parameters: 17B
Vocabulary Size: 248,320
Number of Layers: 60
Hidden Dimension: 4,096
Hidden Layout: 15 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))
Gated DeltaNet: 64 linear attention heads for V, 16 for QK; head dimension 128
Gated Attention: 32 heads for Q, 2 for KV; head dimension 256; RoPE dimension 64
Mixture-of-Experts: 512 total experts; 10 routed + 1 shared activated per token; expert intermediate dimension 1,024
Multi-Token Prediction (MTP): Trained with multi-steps
Input:
Input Types: Text, Image, Video
Input Formats: String, Red, Green, Blue (RGB), Video (MP4/WebM)
Input Parameters: One-Dimensional (1D), Two-Dimensional (2D), Three-Dimensional (3D)
Other Input Properties: Supports text, image, and video inputs via a unified multimodal architecture with early fusion training on multimodal tokens. Combines a Vision Transformer (ViT) encoder with a Hybrid MoE language model featuring Gated DeltaNet layers with global attention and fine-grained MoE routing (10 routed + 1 shared expert out of 512 total).
Input Context Length (ISL): 262,144 tokens natively; extensible up to 1,010,000 tokens via YaRN RoPE scaling
Output:
Output Types: Text
Output Format: String
Output Parameters: One-Dimensional (1D)
Other Output Properties: Generates text responses based on multimodal inputs. Operates in thinking mode by default (internal reasoning via <think>...</think> tags before final response; can be disabled). Natively supports tool/function calling, agentic workflows (Qwen-Agent, MCP servers), and multi-turn conversations.
Output Context Length (OSL): Recommended 32,768 tokens for most queries; up to 81,920 tokens for complex reasoning tasks
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engines:
- SGLang: Primary optimized inference engine (supports MTP, tool calling, reasoning parser)
- vLLM: Full support (supports MTP, tool calling, text-only mode, reasoning parser)
- Hugging Face Transformers: Lightweight serving support (continuous batching)
Supported Hardware:
- NVIDIA Ampere: A100
- NVIDIA Blackwell: B100, B200, GB200
- NVIDIA Hopper: H100, H200
Preferred/Supported Operating Systems: Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s)
Qwen3.5-397B-A17B v1.0
Training, Testing, and Evaluation Datasets:
Training Dataset
Data Modality: Image, Text, Video
Training Data Collection: Undisclosed, Automated
Training Labeling: Automated
Training Properties: Trillions of multimodal tokens across image, text, and video domains, covering 201 languages and dialects. Specific dataset names and data licenses are not disclosed.
Testing Dataset
Testing Data Collection: Undisclosed
Testing Labeling: Undisclosed
Testing Properties: Undisclosed
Evaluation Dataset
Evaluation Data Collection: Public benchmark datasets as listed below
Evaluation Labeling: Ground-truth labels from established benchmark suites
Evaluation Properties: Evaluated in thinking mode with recommended sampling parameters (temperature=0.6, top_p=0.95, top_k=20)
Evaluation Benchmark Scores (Language)
| GPT5.2 | Claude 4.5 Opus | Gemini-3 Pro | Qwen3-Max-Thinking | K2.5-1T-A32B | Qwen3.5-397B-A17B | |
|---|---|---|---|---|---|---|
| Knowledge | ||||||
| MMLU-Pro | 87.4 | 89.5 | 89.8 | 85.7 | 87.1 | 87.8 |
| MMLU-Redux | 95.0 | 95.6 | 95.9 | 92.8 | 94.5 | 94.9 |
| SuperGPQA | 67.9 | 70.6 | 74.0 | 67.3 | 69.2 | 70.4 |
| C-Eval | 90.5 | 92.2 | 93.4 | 93.7 | 94.0 | 93.0 |
| Instruction Following | ||||||
| IFEval | 94.8 | 90.9 | 93.5 | 93.4 | 93.9 | 92.6 |
| IFBench | 75.4 | 58.0 | 70.4 | 70.9 | 70.2 | 76.5 |
| MultiChallenge | 57.9 | 54.2 | 64.2 | 63.3 | 62.7 | 67.6 |
| Long Context | ||||||
| AA-LCR | 72.7 | 74.0 | 70.7 | 68.7 | 70.0 | 68.7 |
| LongBench v2 | 54.5 | 64.4 | 68.2 | 60.6 | 61.0 | 63.2 |
| STEM | ||||||
| GPQA | 92.4 | 87.0 | 91.9 | 87.4 | 87.6 | 88.4 |
| HLE | 35.5 | 30.8 | 37.5 | 30.2 | 30.1 | 28.7 |
| HLE-Verified¹ | 43.3 | 38.8 | 48 | 37.6 | -- | 37.6 |
| Reasoning | ||||||
| LiveCodeBench v6 | 87.7 | 84.8 | 90.7 | 85.9 | 85.0 | 83.6 |
| HMMT Feb 25 | 99.4 | 92.9 | 97.3 | 98.0 | 95.4 | 94.8 |
| HMMT Nov 25 | 100 | 93.3 | 93.3 | 94.7 | 91.1 | 92.7 |
| IMOAnswerBench | 86.3 | 84.0 | 83.3 | 83.9 | 81.8 | 80.9 |
| AIME26 | 96.7 | 93.3 | 90.6 | 93.3 | 93.3 | 91.3 |
| General Agent | ||||||
| BFCL-V4 | 63.1 | 77.5 | 72.5 | 67.7 | 68.3 | 72.9 |
| TAU2-Bench | 87.1 | 91.6 | 85.4 | 84.6 | 77.0 | 86.7 |
| VITA-Bench | 38.2 | 56.3 | 51.6 | 40.9 | 41.9 | 49.7 |
| DeepPlanning | 44.6 | 33.9 | 23.3 | 28.7 | 14.5 | 34.3 |
| Tool Decathlon | 43.8 | 43.5 | 36.4 | 18.8 | 27.8 | 38.3 |
| MCP-Mark | 57.5 | 42.3 | 53.9 | 33.5 | 29.5 | 46.1 |
| Search Agent³ | ||||||
| HLE w/ tool | 45.5 | 43.4 | 45.8 | 49.8 | 50.2 | 48.3 |
| BrowseComp | 65.8 | 67.8 | 59.2 | 53.9 | --/74.9 | 69.0/78.6 |
| BrowseComp-zh | 76.1 | 62.4 | 66.8 | 60.9 | -- | 70.3 |
| WideSearch | 76.8 | 76.4 | 68.0 | 57.9 | 72.7 | 74.0 |
| Seal-0 | 45.0 | 47.7 | 45.5 | 46.9 | 57.4 | 46.9 |
| Multilingualism | ||||||
| MMMLU | 89.5 | 90.1 | 90.6 | 84.4 | 86.0 | 88.5 |
| MMLU-ProX | 83.7 | 85.7 | 87.7 | 78.5 | 82.3 | 84.7 |
| NOVA-63 | 54.6 | 56.7 | 56.7 | 54.2 | 56.0 | 59.1 |
| INCLUDE | 87.5 | 86.2 | 90.5 | 82.3 | 83.3 | 85.6 |
| Global PIQA | 90.9 | 91.6 | 93.2 | 86.0 | 89.3 | 89.8 |
| PolyMATH | 62.5 | 79.0 | 81.6 | 64.7 | 43.1 | 73.3 |
| WMT24++ | 78.8 | 79.7 | 80.7 | 77.6 | 77.6 | 78.9 |
| MAXIFE | 88.4 | 79.2 | 87.5 | 84.0 | 72.8 | 88.2 |
| Coding Agent | ||||||
| SWE-bench Verified | 80.0 | 80.9 | 76.2 | 75.3 | 76.8 | 76.4 |
| SWE-bench Multilingual | 72.0 | 77.5 | 65.0 | 66.7 | 73.0 | 69.3 |
| SecCodeBench | 68.7 | 68.6 | 62.4 | 57.5 | 61.3 | 68.3 |
| Terminal Bench 2 | 54.0 | 59.3 | 54.2 | 22.5 | 50.8 | 52.5 |
* HLE-Verified: a verified and revised version of Humanity's Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.
* TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
* MCPMark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.
* Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.
* BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.
* WideSearch: we use a 256k context window without any context management.
* MMLU-ProX: we report the averaged accuracy on 29 languages.
* WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
* MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
* Empty cells (--) indicate scores not yet available or not applicable.
Evaluation Benchmark Scores (Vision-Language)
| GPT5.2 | Claude 4.5 Opus | Gemini-3 Pro | Qwen3-VL-235B-A22B | K2.5-1T-A32B | Qwen3.5-397B-A17B | |
|---|---|---|---|---|---|---|
| STEM and Puzzle | ||||||
| MMMU | 86.7 | 80.7 | 87.2 | 80.6 | 84.3 | 85.0 |
| MMMU-Pro | 79.5 | 70.6 | 81.0 | 69.3 | 78.5 | 79.0 |
| MathVision | 83.0 | 74.3 | 86.6 | 74.6 | 84.2 | 88.6 |
| Mathvista(mini) | 83.1 | 80.0 | 87.9 | 85.8 | 90.1 | 90.3 |
| We-Math | 79.0 | 70.0 | 86.9 | 74.8 | 84.7 | 87.9 |
| DynaMath | 86.8 | 79.7 | 85.1 | 82.8 | 84.4 | 86.3 |
| ZEROBench | 9 | 3 | 10 | 4 | 9 | 12 |
| ZEROBench_sub | 33.2 | 28.4 | 39.0 | 28.4 | 33.5 | 41.0 |
| BabyVision | 34.4 | 14.2 | 49.7 | 22.2 | 36.5 | 52.3/43.3 |
| General VQA | ||||||
| RealWorldQA | 83.3 | 77.0 | 83.3 | 81.3 | 81.0 | 83.9 |
| MMStar | 77.1 | 73.2 | 83.1 | 78.7 | 80.5 | 83.8 |
| HallusionBench | 65.2 | 64.1 | 68.6 | 66.7 | 69.8 | 71.4 |
| MMBenchEN-DEV-v1.1 | 88.2 | 89.2 | 93.7 | 89.7 | 94.2 | 93.7 |
| SimpleVQA | 55.8 | 65.7 | 73.2 | 61.3 | 71.2 | 67.1 |
| Text Recognition and Document Understanding | ||||||
| OmniDocBench1.5 | 85.7 | 87.7 | 88.5 | 84.5 | 88.8 | 90.8 |
| CharXiv(RQ) | 82.1 | 68.5 | 81.4 | 66.1 | 77.5 | 80.8 |
| MMLongBench-Doc | -- | 61.9 | 60.5 | 56.2 | 58.5 | 61.5 |
| CC-OCR | 70.3 | 76.9 | 79.0 | 81.5 | 79.7 | 82.0 |
| AI2D_TEST | 92.2 | 87.7 | 94.1 | 89.2 | 90.8 | 93.9 |
| OCRBench | 80.7 | 85.8 | 90.4 | 87.5 | 92.3 | 93.1 |
| Spatial Intelligence | ||||||
| ERQA | 59.8 | 46.8 | 70.5 | 52.5 | -- | 67.5 |
| CountBench | 91.9 | 90.6 | 97.3 | 93.7 | 94.1 | 97.2 |
| RefCOCO(avg) | -- | -- | 84.1 | 91.1 | 87.8 | 92.3 |
| ODInW13 | -- | -- | 46.3 | 43.2 | -- | 47.0 |
| EmbSpatialBench | 81.3 | 75.7 | 61.2 | 84.3 | 77.4 | 84.5 |
| RefSpatialBench | -- | -- | 65.5 | 69.9 | -- | 73.6 |
| LingoQA | 68.8 | 78.8 | 72.8 | 66.8 | 68.2 | 81.6 |
| V* | 75.9 | 67.0 | 88.0 | 85.9 | 77.0 | 95.8/91.1 |
| Hypersim | -- | -- | -- | 11.0 | -- | 12.5 |
| SUNRGBD | -- | -- | -- | 34.9 | -- | 38.3 |
| Nuscene | -- | -- | -- | 13.9 | -- | 16.0 |
| Video Understanding | ||||||
| VideoMME(w sub.) | 86 | 77.6 | 88.4 | 83.8 | 87.4 | 87.5 |
| VideoMME(w/o sub.) | 85.8 | 81.4 | 87.7 | 79.0 | 83.2 | 83.7 |
| VideoMMMU | 85.9 | 84.4 | 87.6 | 80.0 | 86.6 | 84.7 |
| MLVU (M-Avg) | 85.6 | 81.7 | 83.0 | 83.8 | 85.0 | 86.7 |
| MVBench | 78.1 | 67.2 | 74.1 | 75.2 | 73.5 | 77.6 |
| LVBench | 73.7 | 57.3 | 76.2 | 63.6 | 75.9 | 75.5 |
| MMVU | 80.8 | 77.3 | 77.5 | 71.1 | 80.4 | 75.4 |
| Visual Agent | ||||||
| ScreenSpot Pro | -- | 45.7 | 72.7 | 62.0 | -- | 65.6 |
| OSWorld-Verified | 38.2 | 66.3 | -- | 38.1 | 63.3 | 62.2 |
| AndroidWorld | -- | -- | -- | 63.7 | -- | 66.8 |
| Medical VQA | ||||||
| SLAKE | 76.9 | 76.4 | 81.3 | 54.7 | 81.6 | 79.9 |
| PMC-VQA | 58.9 | 59.9 | 62.3 | 41.2 | 63.3 | 64.2 |
| MedXpertQA-MM | 73.3 | 63.6 | 76.0 | 47.6 | 65.3 | 70.0 |
* MathVision: our model's score is evaluated using a fixed prompt, e.g., "Please reason step by step, and put your final answer within \boxed{}." For other models, we report the higher score between runs with and without the \boxed{} formatting.
* BabyVision: our model's score is reported with CI (Code Interpreter) enabled; without CI, the result is 43.3.
* V*: our model's score is reported with CI (Code Interpreter) enabled; without CI, the result is 91.1.
* Empty cells (--) indicate scores not yet available or not applicable.
Best Practices
To achieve optimal performance, Qwen recommends the following settings:
-
Sampling Parameters:
- Qwen suggests using
Temperature=0.6,TopP=0.95,TopK=20, andMinP=0for thinking mode and usingTemperature=0.7,TopP=0.8,TopK=20, andMinP=0for non-thinking mode. - For supported frameworks, you can adjust the
presence_penaltyparameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
- Qwen suggests using
-
Adequate Output Length: Qwen recommends using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, Qwen suggests setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
-
Standardize Output Format: Qwen recommends using prompts to standardize model outputs when benchmarking.
- Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
- Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the
answerfield with only the choice letter, e.g.,"answer": "C"."
-
No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.
-
Long Video Understanding: To optimize inference efficiency for plain text and images, the
sizeparameter in the releasedvideo_preprocessor_config.jsonis conservatively configured. It is recommended to set thelongest_edgeparameter in the video_preprocessor_config file to 469,762,048 (corresponding to 224k video tokens) to enable higher frame-rate sampling for hour-scale videos and thereby achieve superior performance. For example,{"longest_edge": 469762048, "shortest_edge": 4096}Alternatively, override the default values via engine startup parameters. For implementation details, refer to: vLLM / SGLang.
Inference
Acceleration Engine: SGLang
Test Hardware: NVIDIA B200 GPU
Recommended Serving Configuration: TP=8, context length 262,144, reasoning parser qwen3; supports Multi-Token Prediction (MTP) for improved throughput
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
