Qwen3.5-397B-A17B

Description

Qwen3.5-397B-A17B is a multimodal foundation model featuring a Hybrid Mixture-of-Experts architecture with early fusion vision-language training, designed for state-of-the-art performance across chat, retrieval-augmented generation, vision-language understanding, video understanding, and agentic workflows. Key highlights include:

Unified Vision-Language Foundation: Early fusion training on multimodal tokens achieves cross-generational parity with prior text-only models and outperforms dedicated vision-language models across reasoning, coding, agents, and visual understanding benchmarks.
Efficient Hybrid Architecture: Gated Delta Networks combined with sparse Mixture-of-Experts deliver high-throughput inference with minimal latency and cost overhead.
Scalable RL Generalization: Reinforcement learning scaled across million-agent environments with progressively complex task distributions for robust real-world adaptability.
Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.
Thinking Mode: Operates in thinking mode by default, generating internal reasoning content (<think>...</think>) before producing final responses, with an option to disable for direct responses.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration:

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Qwen3.5-397B-A17B Model Card

License and Terms of Use:

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information: Apache 2.0 License.

Deployment Geography:

Global

Use Case:

Use Case: Designed for developers and enterprises building multimodal AI applications including conversational chat, retrieval-augmented generation (RAG), vision-language understanding and reasoning, tool/function calling, and agentic workflows.

Release Date:

Build.NVIDIA.com: 02/16/2026 via link
Huggingface: 02/16/2026 via link

Model Architecture:

Architecture Type: Transformer (Causal Language Model with Vision Encoder)
Network Architecture: Qwen3.5 (Hybrid Mixture-of-Experts with Gated DeltaNet)
Total Parameters: 397B
Active Parameters: 17B
Vocabulary Size: 248,320
Number of Layers: 60
Hidden Dimension: 4,096
Hidden Layout: 15 × (3 × (Gated DeltaNet → MoE) → 1 × (Gated Attention → MoE))
Gated DeltaNet: 64 linear attention heads for V, 16 for QK; head dimension 128
Gated Attention: 32 heads for Q, 2 for KV; head dimension 256; RoPE dimension 64
Mixture-of-Experts: 512 total experts; 10 routed + 1 shared activated per token; expert intermediate dimension 1,024
Multi-Token Prediction (MTP): Trained with multi-steps

Input:

Input Types: Text, Image, Video
Input Formats: String, Red, Green, Blue (RGB), Video (MP4/WebM)
Input Parameters: One-Dimensional (1D), Two-Dimensional (2D), Three-Dimensional (3D)
Other Input Properties: Supports text, image, and video inputs via a unified multimodal architecture with early fusion training on multimodal tokens. Combines a Vision Transformer (ViT) encoder with a Hybrid MoE language model featuring Gated DeltaNet layers with global attention and fine-grained MoE routing (10 routed + 1 shared expert out of 512 total).
Input Context Length (ISL): 262,144 tokens natively; extensible up to 1,010,000 tokens via YaRN RoPE scaling

Output:

Output Types: Text
Output Format: String
Output Parameters: One-Dimensional (1D)
Other Output Properties: Generates text responses based on multimodal inputs. Operates in thinking mode by default (internal reasoning via <think>...</think> tags before final response; can be disabled). Natively supports tool/function calling, agentic workflows (Qwen-Agent, MCP servers), and multi-turn conversations.
Output Context Length (OSL): Recommended 32,768 tokens for most queries; up to 81,920 tokens for complex reasoning tasks

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engines:

SGLang: Primary optimized inference engine (supports MTP, tool calling, reasoning parser)
vLLM: Full support (supports MTP, tool calling, text-only mode, reasoning parser)
Hugging Face Transformers: Lightweight serving support (continuous batching)

Supported Hardware:

NVIDIA Ampere: A100
NVIDIA Blackwell: B100, B200, GB200
NVIDIA Hopper: H100, H200

Preferred/Supported Operating Systems: Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

Qwen3.5-397B-A17B v1.0

Training, Testing, and Evaluation Datasets:

Training Dataset

Data Modality: Image, Text, Video
Training Data Collection: Undisclosed, Automated
Training Labeling: Automated
Training Properties: Trillions of multimodal tokens across image, text, and video domains, covering 201 languages and dialects. Specific dataset names and data licenses are not disclosed.

Testing Dataset

Testing Data Collection: Undisclosed
Testing Labeling: Undisclosed
Testing Properties: Undisclosed

Evaluation Dataset

Evaluation Data Collection: Public benchmark datasets as listed below
Evaluation Labeling: Ground-truth labels from established benchmark suites
Evaluation Properties: Evaluated in thinking mode with recommended sampling parameters (temperature=0.6, top_p=0.95, top_k=20)

Evaluation Benchmark Scores (Language)

	GPT5.2	Claude 4.5 Opus	Gemini-3 Pro	Qwen3-Max-Thinking	K2.5-1T-A32B	Qwen3.5-397B-A17B
Knowledge
MMLU-Pro	87.4	89.5	89.8	85.7	87.1	87.8
MMLU-Redux	95.0	95.6	95.9	92.8	94.5	94.9
SuperGPQA	67.9	70.6	74.0	67.3	69.2	70.4
C-Eval	90.5	92.2	93.4	93.7	94.0	93.0
Instruction Following
IFEval	94.8	90.9	93.5	93.4	93.9	92.6
IFBench	75.4	58.0	70.4	70.9	70.2	76.5
MultiChallenge	57.9	54.2	64.2	63.3	62.7	67.6
Long Context
AA-LCR	72.7	74.0	70.7	68.7	70.0	68.7
LongBench v2	54.5	64.4	68.2	60.6	61.0	63.2
STEM
GPQA	92.4	87.0	91.9	87.4	87.6	88.4
HLE	35.5	30.8	37.5	30.2	30.1	28.7
HLE-Verified¹	43.3	38.8	48	37.6	--	37.6
Reasoning
LiveCodeBench v6	87.7	84.8	90.7	85.9	85.0	83.6
HMMT Feb 25	99.4	92.9	97.3	98.0	95.4	94.8
HMMT Nov 25	100	93.3	93.3	94.7	91.1	92.7
IMOAnswerBench	86.3	84.0	83.3	83.9	81.8	80.9
AIME26	96.7	93.3	90.6	93.3	93.3	91.3
General Agent
BFCL-V4	63.1	77.5	72.5	67.7	68.3	72.9
TAU2-Bench	87.1	91.6	85.4	84.6	77.0	86.7
VITA-Bench	38.2	56.3	51.6	40.9	41.9	49.7
DeepPlanning	44.6	33.9	23.3	28.7	14.5	34.3
Tool Decathlon	43.8	43.5	36.4	18.8	27.8	38.3
MCP-Mark	57.5	42.3	53.9	33.5	29.5	46.1
Search Agent³
HLE w/ tool	45.5	43.4	45.8	49.8	50.2	48.3
BrowseComp	65.8	67.8	59.2	53.9	--/74.9	69.0/78.6
BrowseComp-zh	76.1	62.4	66.8	60.9	--	70.3
WideSearch	76.8	76.4	68.0	57.9	72.7	74.0
Seal-0	45.0	47.7	45.5	46.9	57.4	46.9
Multilingualism
MMMLU	89.5	90.1	90.6	84.4	86.0	88.5
MMLU-ProX	83.7	85.7	87.7	78.5	82.3	84.7
NOVA-63	54.6	56.7	56.7	54.2	56.0	59.1
INCLUDE	87.5	86.2	90.5	82.3	83.3	85.6
Global PIQA	90.9	91.6	93.2	86.0	89.3	89.8
PolyMATH	62.5	79.0	81.6	64.7	43.1	73.3
WMT24++	78.8	79.7	80.7	77.6	77.6	78.9
MAXIFE	88.4	79.2	87.5	84.0	72.8	88.2
Coding Agent
SWE-bench Verified	80.0	80.9	76.2	75.3	76.8	76.4
SWE-bench Multilingual	72.0	77.5	65.0	66.7	73.0	69.3
SecCodeBench	68.7	68.6	62.4	57.5	61.3	68.3
Terminal Bench 2	54.0	59.3	54.2	22.5	50.8	52.5

* HLE-Verified: a verified and revised version of Humanity's Last Exam (HLE), accompanied by a transparent, component-wise verification protocol and a fine-grained error taxonomy. We open-source the dataset at https://huggingface.co/datasets/skylenage/HLE-Verified.
* TAU2-Bench: we follow the official setup except for the airline domain, where all models are evaluated by applying the fixes proposed in the Claude Opus 4.5 system card.
* MCPMark: GitHub MCP server uses v0.30.3 from api.githubcopilot.com; Playwright tool responses are truncated at 32k tokens.
* Search Agent: most search agents built on our model adopt a simple context-folding strategy(256k): once the cumulative Tool Response length reaches a preset threshold, earlier Tool Responses are pruned from the history to keep the context within limits.
* BrowseComp: we tested two strategies, simple context-folding achieved a score of 69.0, while using the same discard-all strategy as DeepSeek-V3.2 and Kimi K2.5 achieved 78.6.
* WideSearch: we use a 256k context window without any context management.
* MMLU-ProX: we report the averaged accuracy on 29 languages.
* WMT24++: a harder subset of WMT24 after difficulty labeling and rebalancing; we report the averaged scores on 55 languages using XCOMET-XXL.
* MAXIFE: we report the accuracy on English + multilingual original prompts (totally 23 settings).
* Empty cells (--) indicate scores not yet available or not applicable.

Evaluation Benchmark Scores (Vision-Language)

	GPT5.2	Claude 4.5 Opus	Gemini-3 Pro	Qwen3-VL-235B-A22B	K2.5-1T-A32B	Qwen3.5-397B-A17B
STEM and Puzzle
MMMU	86.7	80.7	87.2	80.6	84.3	85.0
MMMU-Pro	79.5	70.6	81.0	69.3	78.5	79.0
MathVision	83.0	74.3	86.6	74.6	84.2	88.6
Mathvista(mini)	83.1	80.0	87.9	85.8	90.1	90.3
We-Math	79.0	70.0	86.9	74.8	84.7	87.9
DynaMath	86.8	79.7	85.1	82.8	84.4	86.3
ZEROBench	9	3	10	4	9	12
ZEROBench_sub	33.2	28.4	39.0	28.4	33.5	41.0
BabyVision	34.4	14.2	49.7	22.2	36.5	52.3/43.3
General VQA
RealWorldQA	83.3	77.0	83.3	81.3	81.0	83.9
MMStar	77.1	73.2	83.1	78.7	80.5	83.8
HallusionBench	65.2	64.1	68.6	66.7	69.8	71.4
MMBench_EN-DEV-v1.1	88.2	89.2	93.7	89.7	94.2	93.7
SimpleVQA	55.8	65.7	73.2	61.3	71.2	67.1
Text Recognition and Document Understanding
OmniDocBench1.5	85.7	87.7	88.5	84.5	88.8	90.8
CharXiv(RQ)	82.1	68.5	81.4	66.1	77.5	80.8
MMLongBench-Doc	--	61.9	60.5	56.2	58.5	61.5
CC-OCR	70.3	76.9	79.0	81.5	79.7	82.0
AI2D_TEST	92.2	87.7	94.1	89.2	90.8	93.9
OCRBench	80.7	85.8	90.4	87.5	92.3	93.1
Spatial Intelligence
ERQA	59.8	46.8	70.5	52.5	--	67.5
CountBench	91.9	90.6	97.3	93.7	94.1	97.2
RefCOCO(avg)	--	--	84.1	91.1	87.8	92.3
ODInW13	--	--	46.3	43.2	--	47.0
EmbSpatialBench	81.3	75.7	61.2	84.3	77.4	84.5
RefSpatialBench	--	--	65.5	69.9	--	73.6
LingoQA	68.8	78.8	72.8	66.8	68.2	81.6
V*	75.9	67.0	88.0	85.9	77.0	95.8/91.1
Hypersim	--	--	--	11.0	--	12.5
SUNRGBD	--	--	--	34.9	--	38.3
Nuscene	--	--	--	13.9	--	16.0
Video Understanding
VideoMME_{(w sub.)}	86	77.6	88.4	83.8	87.4	87.5
VideoMME_{(w/o sub.)}	85.8	81.4	87.7	79.0	83.2	83.7
VideoMMMU	85.9	84.4	87.6	80.0	86.6	84.7
MLVU (M-Avg)	85.6	81.7	83.0	83.8	85.0	86.7
MVBench	78.1	67.2	74.1	75.2	73.5	77.6
LVBench	73.7	57.3	76.2	63.6	75.9	75.5
MMVU	80.8	77.3	77.5	71.1	80.4	75.4
Visual Agent
ScreenSpot Pro	--	45.7	72.7	62.0	--	65.6
OSWorld-Verified	38.2	66.3	--	38.1	63.3	62.2
AndroidWorld	--	--	--	63.7	--	66.8
Medical VQA
SLAKE	76.9	76.4	81.3	54.7	81.6	79.9
PMC-VQA	58.9	59.9	62.3	41.2	63.3	64.2
MedXpertQA-MM	73.3	63.6	76.0	47.6	65.3	70.0

* MathVision: our model's score is evaluated using a fixed prompt, e.g., "Please reason step by step, and put your final answer within \boxed{}." For other models, we report the higher score between runs with and without the \boxed{} formatting.
* BabyVision: our model's score is reported with CI (Code Interpreter) enabled; without CI, the result is 43.3.
* V*: our model's score is reported with CI (Code Interpreter) enabled; without CI, the result is 91.1.
* Empty cells (--) indicate scores not yet available or not applicable.

Best Practices

To achieve optimal performance, Qwen recommends the following settings:

Sampling Parameters:
- Qwen suggests using Temperature=0.6, TopP=0.95, TopK=20, and MinP=0 for thinking mode and using Temperature=0.7, TopP=0.8, TopK=20, and MinP=0 for non-thinking mode.
- For supported frameworks, you can adjust the presence_penalty parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
Adequate Output Length: Qwen recommends using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, Qwen suggests setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
Standardize Output Format: Qwen recommends using prompts to standardize model outputs when benchmarking.
- Math Problems: Include "Please reason step by step, and put your final answer within \boxed{}." in the prompt.
- Multiple-Choice Questions: Add the following JSON structure to the prompt to standardize responses: "Please show your choice in the answer field with only the choice letter, e.g., "answer": "C"."
No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final output part and does not need to include the thinking content. It is implemented in the provided chat template in Jinja2. However, for frameworks that do not directly use the Jinja2 chat template, it is up to the developers to ensure that the best practice is followed.
Long Video Understanding: To optimize inference efficiency for plain text and images, the size parameter in the released video_preprocessor_config.json is conservatively configured. It is recommended to set the longest_edge parameter in the video_preprocessor_config file to 469,762,048 (corresponding to 224k video tokens) to enable higher frame-rate sampling for hour-scale videos and thereby achieve superior performance. For example,
```
{"longest_edge": 469762048, "shortest_edge": 4096}
```
Alternatively, override the default values via engine startup parameters. For implementation details, refer to: vLLM / SGLang.

Inference

Acceleration Engine: SGLang
Test Hardware: NVIDIA B200 GPU
Recommended Serving Configuration: TP=8, context length 262,144, reasoning parser qwen3; supports Multi-Token Prediction (MTP) for improved throughput

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.