GLM-4.7
Description
GLM-4.7 is a large language model developed by Z.ai (formerly THUDM/Zhipu AI) optimized for coding, reasoning, and tool use. It features significant improvements in multilingual agentic coding, terminal-based tasks, UI generation, and complex mathematical reasoning compared to its predecessor GLM-4.6. The model introduces Interleaved Thinking, Preserved Thinking, and Turn-level Thinking capabilities for more stable and controllable complex task execution.
This model is ready for commercial/non-commercial use.
Third-Party Community Consideration:
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA GLM-4.7 Model Card
License and Terms of Use:
GOVERNING TERMS: The trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License. Additional Information: MIT.
Deployment Geography:
Global
Use Case:
Use Case: Developers and researchers can use GLM-4.7 for coding assistance, agentic workflows, terminal-based automation, mathematical reasoning, and general text generation tasks including chat, creative writing, and role-play scenarios.
Release Date:
Build.NVIDIA.com: 01/2026 via link
Huggingface: 12/22/2025 via link
Reference(s):
References:
Model Architecture:
Architecture Type: Transformer
Network Architecture: GLM (General Language Model)
Total Parameters: 358B
Base Model: GLM-4.5/GLM-4.6
Input:
Input Types: Text
Input Formats: String
Input Parameters: One Dimensional (1D)
Other Input Properties: Supports multi-turn conversations, tool calling, and system prompts.
Input Context Length (ISL): 131,072 tokens
Output:
Output Types: Text
Output Format: String
Output Parameters: One Dimensional (1D)
Other Output Properties: Supports streaming, structured output, and reasoning traces.
Output Context Length (OSL): 131,072 tokens
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engines:
- vLLM: nightly
- SGLang: dev
- Transformers: 4.57.3+
Supported Hardware:
- NVIDIA Ampere: A100
- NVIDIA Hopper: H100
Operating Systems: Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s)
GLM-4 v4.7
Training, Testing, and Evaluation Datasets:
Training Dataset
Data Modality: Text (English, Chinese)
Training Data Collection: Undisclosed
Training Labeling: Undisclosed
Training Properties: Undisclosed
Testing Dataset
Testing Data Collection: Undisclosed
Testing Labeling: Undisclosed
Testing Properties: Undisclosed
Evaluation Dataset
Evaluation Data Collection: Automated
Evaluation Labeling: Hybrid: Human, Automated
Evaluation Properties: Benchmark testing conditions: Multi-domain evaluation including reasoning (8 benchmarks), coding (5 benchmarks), and agent tasks (3+ benchmarks). Standard industry benchmarks with comparable methodology across models.
Evaluation Benchmark Score: GLM-4.7 demonstrates strong performance across 17 benchmarks spanning reasoning (8), coding (5), and agent tasks (3+). Key highlights: AIME 2025 (95.7%), HMMT Feb. 2025 (97.1%), GPQA-Diamond (85.7%), LiveCodeBench-v6 (84.9%), tau2-Bench (87.4%).
Detailed Benchmark Comparison Table
| Benchmark | GLM-4.7 | GLM-4.6 | Kimi K2 Thinking | DeepSeek-V3.2 | Gemini 3.0 Pro | Claude Sonnet 4.5 | GPT-5-High | GPT-5.1-High |
|---|---|---|---|---|---|---|---|---|
| MMLU-Pro | 84.3 | 83.2 | 84.6 | 85.0 | 90.1 | 88.2 | 87.5 | 87.0 |
| GPQA-Diamond | 85.7 | 81.0 | 84.5 | 82.4 | 91.9 | 83.4 | 85.7 | 88.1 |
| HLE | 24.8 | 17.2 | 23.9 | 25.1 | 37.5 | 13.7 | 26.3 | 25.7 |
| HLE (w/ Tools) | 42.8 | 30.4 | 44.9 | 40.8 | 45.8 | 32.0 | 35.2 | 42.7 |
| AIME 2025 | 95.7 | 93.9 | 94.5 | 93.1 | 95.0 | 87.0 | 94.6 | 94.0 |
| HMMT Feb. 2025 | 97.1 | 89.2 | 89.4 | 92.5 | 97.5 | 79.2 | 88.3 | 96.3 |
| HMMT Nov. 2025 | 93.5 | 87.7 | 89.2 | 90.2 | 93.3 | 81.7 | 89.2 | - |
| IMOAnswerBench | 82.0 | 73.5 | 78.6 | 78.3 | 83.3 | 65.8 | 76.0 | - |
| LiveCodeBench-v6 | 84.9 | 82.8 | 83.1 | 83.3 | 90.7 | 64.0 | 87.0 | 87.0 |
| SWE-bench Verified | 73.8 | 68.0 | 71.3 | 73.1 | 76.2 | 77.2 | 74.9 | 76.3 |
| SWE-bench Multilingual | 66.7 | 53.8 | 61.1 | 70.2 | - | 68.0 | 55.3 | - |
| Terminal Bench Hard | 33.3 | 23.6 | 30.6 | 35.4 | 39.0 | 33.3 | 30.5 | 43.0 |
| Terminal Bench 2.0 | 41.0 | 24.5 | 35.7 | 46.4 | 54.2 | 42.8 | 35.2 | 47.6 |
| BrowseComp | 52.0 | 45.1 | - | 51.4 | - | 24.1 | 54.9 | 50.8 |
| BrowseComp (w/ Context Manage) | 67.5 | 57.5 | 60.2 | 67.6 | 59.2 | - | - | - |
| BrowseComp-Zh | 66.6 | 49.5 | 62.3 | 65.0 | - | 42.4 | 63.0 | - |
| τ²-Bench | 87.4 | 75.2 | 74.3 | 85.3 | 90.7 | 87.2 | 82.4 | 82.7 |
Inference
Acceleration Engine: SGLang
Test Hardware: NVIDIA H100x8
Additional Details
Key Features
- Interleaved Thinking: The model thinks before every response and tool calling, improving instruction following and generation quality.
- Preserved Thinking: In coding agent scenarios, the model retains thinking blocks across multi-turn conversations, reducing information loss.
- Turn-level Thinking: Per-turn control over reasoning - disable for lightweight requests, enable for complex tasks.
Recommended Inference Settings
| Task Type | Temperature | Top-p | Max Tokens |
|---|---|---|---|
| Default | 1.0 | 0.95 | 131,072 |
| SWE-bench/Terminal | 0.7 | 1.0 | 16,384 |
| τ^2-Bench | 0 | - | 16,384 |
For τ^2-Bench evaluation, zai-org added an additional prompt to the Retail and Telecom user interaction to avoid failure modes caused by users ending the interaction incorrectly. For the Airline domain, we applied the domain fixes as proposed in the Claude Opus 4.5 release report.
Deployment Examples
Serve GLM-4.7 Locally:
For local deployment, GLM-4.7 supports inference frameworks including vLLM and SGLang. Comprehensive deployment instructions are available in the official Github repository.
vLLM and SGLang only support GLM-4.7 on their main branches. you can use their official docker images for inference.
vLLM:
vllm serve zai-org/GLM-4.7-FP8 \
--tensor-parallel-size 4 \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--enable-auto-tool-choice
SGLang:
python3 -m sglang.launch_server \
--model-path zai-org/GLM-4.7-FP8 \
--tp-size 8 \
--tool-call-parser glm47 \
--reasoning-parser glm45
transformers:
using with transformers as 4.57.3 and then run:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "zai-org/GLM-4.7"
messages = [{"role": "user", "content": "hello"}]
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto",
)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1] :])
print(output_text)
Parameter Instructions
-
For agentic tasks of GLM-4.7, please turn on Preserved Thinking mode by adding the following config (only sglang support):
"chat_template_kwargs": { "enable_thinking": true, "clear_thinking": false } -
When using
vLLMandSGLang, thinking mode is enabled by default when sending requests. If you want to disable the thinking switch, you need to add the"chat_template_kwargs": {"enable_thinking": False}parameter. -
Both support tool calling. Please use OpenAI-style tool description format for calls.
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.
