qwen / qwen3-235b-a22b

Qwen3-235B-A22B Overview

Description:

Qwen3-235B-A22B is the latest generation of large language models in the Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support. It uniquely supports seamless switching between thinking mode (for complex logical reasoning, math, and coding) and non-thinking mode (for efficient, general-purpose dialogue) within a single model, ensuring optimal performance across various scenarios. The model shows significant enhancement in its reasoning capabilities, surpassing previous versions on mathematics, code generation, and commonsense logical reasoning. It also demonstrates superior human preference alignment, excelling in creative writing, role-playing, multi-turn dialogues, and instruction following, to deliver a more natural, engaging, and immersive conversational experience. Furthermore, it has expertise in agent capabilities, enabling precise integration with external tools in both thinking and unthinking modes and achieving leading performance among open-source models in complex agent-based tasks. Qwen3-235B-A22B supports over 100 languages and dialects with strong capabilities for multilingual instruction following and translation.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed by Qwen (Alibaba Cloud).
This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Qwen3-235B-A22B Model Card.

License/Terms of Use:

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Community Model License. Additional Information: Apache 2.0.

Deployment Geography:

Global

Use Case:

This model is expected to be used for a wide range of tasks including:

  • Complex logical reasoning, mathematics, and code generation (thinking mode).
  • Efficient, general-purpose dialogue (non-thinking mode).
  • Instruction-following.
  • Agent capabilities requiring integration with external tools.
  • Multilingual applications, including instruction following and translation across 100+ languages.
  • Creative writing and role-playing.
  • Multi-turn dialogues.

Release Date:

May 2025

Reference(s):

Model Architecture:

Architecture Type: Causal Language Model, Mixture-of-Experts (MoE)

Network Architecture: Qwen3

  • This model was developed based on the Qwen series architecture.
  • This model has 235B total parameters and 22B activated parameters.
  • Number of Parameters (Non-Embedding): 234B
  • Number of Layers: 94
  • Number of Attention Heads (GQA): 64 for Query (Q) and 4 for Key/Value (KV)
  • Number of Experts: 128
  • Number of Activated Experts: 8

Input:

Input Type(s): Text

Input Format(s): String

Input Parameters: One-Dimensional (1D) for text.

Other Properties Related to Input:

  • Context Length: 32,768 tokens natively. Can be extended to 131,072 tokens with YaRN.
  • Pre-Processing Needed: Tokenization using the Qwen3 tokenizer. Input can be formatted for "thinking mode" (enable_thinking=True) or "non-thinking mode" (enable_thinking=False). The tokenizer by default has enable_thinking=True.

Output:

Output Type(s): Text

Output Format: String

Output Parameters: One-Dimensional (1D) for text.

Other Properties Related to Output:

  • Recommended output length: 32,768 tokens for most queries.
  • For benchmarking on highly complex problems (e.g., math, programming competitions), a max output length of 38,912 tokens is suggested.
  • Post-Processing Needed: Detokenization. Parsing of thinking content might be needed if enable_thinking=True.

Software Integration:

Runtime Engine(s):

  • Hugging Face Transformers (version >= 4.51.0 recommended)
  • vLLM (version >= 0.8.5 for OpenAI-compatible API endpoint with reasoning parser)
  • SGLang (version >= 0.4.6.post1 for OpenAI-compatible API endpoint with reasoning parser)
  • Other platforms supporting Qwen3 include: Ollama, LMStudio, MLX-LM, llama.cpp, KTransformers.

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere or newer GPUs are highly recommended for optimal performance due to the model size.

    -This model can be run on various GPU setups, potentially including NVIDIA Hopper, NVIDIA Blackwell, NVIDIA Lovelace, NVIDIA Turing, and NVIDIA Volta, depending on the quantization (e.g., FP8 version mentioned as Qwen3-235B-A22B-FP8) and distributed inference setup. Specific requirements depend on the inference framework and desired throughput/latency.

[Preferred/Supported] Operating System(s):

  • Linux
  • Windows (via Python environments)
  • macOS (via Python environments)

    (Generally OS independent as long as the Python environment and necessary drivers are supported).

Model Version(s):

Qwen3-235B-A22B v1.0

Training, Testing, and Evaluation Datasets:

Training Dataset :

Data Collection Method by dataset: Undisclosed

Labeling Method by dataset: Undisclosed

Properties: Undisclosed

Testing Dataset:

Data Collection Method by dataset: Undisclosed

Labeling Method by dataset: Undisclosed

Properties: Undisclosed

Evaluation Benchmark Results

Data Collection Method by dataset: Undisclosed

Labeling Method by dataset: Undisclosed

Properties: Undisclosed

For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to Qwen3 blog, GitHub, and Documentation.

Inference:

Engine: SGLang

Test Hardware:

  • NVIDIA Hopper

Additional content

  • Qwen3 Highlights: Qwen3 is the latest generation of LLMs in the Qwen series, offering significant advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
  • Thinking Mode: The model supports a "thinking mode" (enable_thinking=True) for complex tasks, which enhances response quality through internal reasoning steps. It can be switched to a "non-thinking mode" (enable_thinking=False) for general dialogue.
  • Quickstart: Code snippets are available on the Hugging Face model card to demonstrate how to load the tokenizer and model, prepare inputs, and generate text.
  • Context Extension (YaRN): The model supports YaRN (Yet another RoPE extensioN method) to extend its context window from 32,768 tokens to 131,072 tokens. Configuration details are provided in the model card.
  • Recommended Sampling Parameters:
    • For thinking mode (enable_thinking=True): Temperature=0.6, TopP=0.95, TopK=20, MinP=0. Greedy decoding is not recommended.
    • For non-thinking mode (enable_thinking=False): Temperature=0.7, TopP=0.8, TopK=20, MinP=0.
    • presence_penalty can be adjusted (0 to 2) to reduce repetition.
  • Output Length: Recommended output length is 32,768 tokens, extendable to 38,912 tokens for complex problems.
  • Standardized Output Format for Benchmarking: Prompts can be used to standardize model outputs, e.g., "Please reason step by step, and put your final answer within \boxed{}.", or using JSON for multiple-choice questions.
  • Multi-turn Conversation History: In multi-turn conversations, the historical model output should only include the final output part, not the thinking content. This is handled by the provided Jinja2 chat template.
  • Qwen-Agent: Qwen-Agent capabilities are integrated, simplifying tool-calling and integration with external tools.

Ethical Considerations:

NVIDIA believes Trustworthy Al is a shared responsibility and we have established policies and practices to enable development for a wide array of Al applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.