z-ai / glm4.7

GLM-4.7

Description

GLM-4.7 is a large language model developed by Z.ai (formerly THUDM/Zhipu AI) optimized for coding, reasoning, and tool use. It features significant improvements in multilingual agentic coding, terminal-based tasks, UI generation, and complex mathematical reasoning compared to its predecessor GLM-4.6. The model introduces Interleaved Thinking, Preserved Thinking, and Turn-level Thinking capabilities for more stable and controllable complex task execution.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration:

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA GLM-4.7 Model Card

License and Terms of Use:

GOVERNING TERMS: The trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License. Additional Information: MIT.

Deployment Geography:

Global

Use Case:

Use Case: Developers and researchers can use GLM-4.7 for coding assistance, agentic workflows, terminal-based automation, mathematical reasoning, and general text generation tasks including chat, creative writing, and role-play scenarios.

Release Date:

Build.NVIDIA.com: 01/2026 via link
Huggingface: 12/22/2025 via link

Reference(s):

References:

Model Architecture:

Architecture Type: Transformer
Network Architecture: GLM (General Language Model)
Total Parameters: 358B
Base Model: GLM-4.5/GLM-4.6

Input:

Input Types: Text
Input Formats: String
Input Parameters: One Dimensional (1D)
Other Input Properties: Supports multi-turn conversations, tool calling, and system prompts.
Input Context Length (ISL): 131,072 tokens

Output:

Output Types: Text
Output Format: String
Output Parameters: One Dimensional (1D)
Other Output Properties: Supports streaming, structured output, and reasoning traces.
Output Context Length (OSL): 131,072 tokens

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engines:

vLLM: nightly
SGLang: dev
Transformers: 4.57.3+

Supported Hardware:

NVIDIA Ampere: A100
NVIDIA Hopper: H100

Operating Systems: Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

GLM-4 v4.7

Training, Testing, and Evaluation Datasets:

Training Dataset

Data Modality: Text (English, Chinese)
Training Data Collection: Undisclosed
Training Labeling: Undisclosed
Training Properties: Undisclosed

Testing Dataset

Testing Data Collection: Undisclosed
Testing Labeling: Undisclosed
Testing Properties: Undisclosed

Evaluation Dataset

Evaluation Data Collection: Automated
Evaluation Labeling: Hybrid: Human, Automated
Evaluation Properties: Benchmark testing conditions: Multi-domain evaluation including reasoning (8 benchmarks), coding (5 benchmarks), and agent tasks (3+ benchmarks). Standard industry benchmarks with comparable methodology across models.

Evaluation Benchmark Score: GLM-4.7 demonstrates strong performance across 17 benchmarks spanning reasoning (8), coding (5), and agent tasks (3+). Key highlights: AIME 2025 (95.7%), HMMT Feb. 2025 (97.1%), GPQA-Diamond (85.7%), LiveCodeBench-v6 (84.9%), tau2-Bench (87.4%).

Detailed Benchmark Comparison Table

Benchmark	GLM-4.7	GLM-4.6	Kimi K2 Thinking	DeepSeek-V3.2	Gemini 3.0 Pro	Claude Sonnet 4.5	GPT-5-High	GPT-5.1-High
MMLU-Pro	84.3	83.2	84.6	85.0	90.1	88.2	87.5	87.0
GPQA-Diamond	85.7	81.0	84.5	82.4	91.9	83.4	85.7	88.1
HLE	24.8	17.2	23.9	25.1	37.5	13.7	26.3	25.7
HLE (w/ Tools)	42.8	30.4	44.9	40.8	45.8	32.0	35.2	42.7
AIME 2025	95.7	93.9	94.5	93.1	95.0	87.0	94.6	94.0
HMMT Feb. 2025	97.1	89.2	89.4	92.5	97.5	79.2	88.3	96.3
HMMT Nov. 2025	93.5	87.7	89.2	90.2	93.3	81.7	89.2	-
IMOAnswerBench	82.0	73.5	78.6	78.3	83.3	65.8	76.0	-
LiveCodeBench-v6	84.9	82.8	83.1	83.3	90.7	64.0	87.0	87.0
SWE-bench Verified	73.8	68.0	71.3	73.1	76.2	77.2	74.9	76.3
SWE-bench Multilingual	66.7	53.8	61.1	70.2	-	68.0	55.3	-
Terminal Bench Hard	33.3	23.6	30.6	35.4	39.0	33.3	30.5	43.0
Terminal Bench 2.0	41.0	24.5	35.7	46.4	54.2	42.8	35.2	47.6
BrowseComp	52.0	45.1	-	51.4	-	24.1	54.9	50.8
BrowseComp (w/ Context Manage)	67.5	57.5	60.2	67.6	59.2	-	-	-
BrowseComp-Zh	66.6	49.5	62.3	65.0	-	42.4	63.0	-
τ²-Bench	87.4	75.2	74.3	85.3	90.7	87.2	82.4	82.7

Inference

Acceleration Engine: SGLang
Test Hardware: NVIDIA H100x8

Additional Details

Key Features

Interleaved Thinking: The model thinks before every response and tool calling, improving instruction following and generation quality.
Preserved Thinking: In coding agent scenarios, the model retains thinking blocks across multi-turn conversations, reducing information loss.
Turn-level Thinking: Per-turn control over reasoning - disable for lightweight requests, enable for complex tasks.

Recommended Inference Settings

Task Type	Temperature	Top-p	Max Tokens
Default	1.0	0.95	131,072
SWE-bench/Terminal	0.7	1.0	16,384
τ^2-Bench	0	-	16,384

For τ^2-Bench evaluation, zai-org added an additional prompt to the Retail and Telecom user interaction to avoid failure modes caused by users ending the interaction incorrectly. For the Airline domain, we applied the domain fixes as proposed in the Claude Opus 4.5 release report.

Deployment Examples

Serve GLM-4.7 Locally:
For local deployment, GLM-4.7 supports inference frameworks including vLLM and SGLang. Comprehensive deployment instructions are available in the official Github repository.

vLLM and SGLang only support GLM-4.7 on their main branches. you can use their official docker images for inference.

vLLM:

vllm serve zai-org/GLM-4.7-FP8 \
     --tensor-parallel-size 4 \
     --tool-call-parser glm47 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice

SGLang:

python3 -m sglang.launch_server \
  --model-path zai-org/GLM-4.7-FP8 \
  --tp-size 8 \
  --tool-call-parser glm47 \
  --reasoning-parser glm45

transformers:
using with transformers as 4.57.3 and then run:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_PATH = "zai-org/GLM-4.7"
messages = [{"role": "user", "content": "hello"}]
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
)
model = AutoModelForCausalLM.from_pretrained(
    pretrained_model_name_or_path=MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
output_text = tokenizer.decode(generated_ids[0][inputs.input_ids.shape[1] :])
print(output_text)

Parameter Instructions

For agentic tasks of GLM-4.7, please turn on Preserved Thinking mode by adding the following config (only sglang support):
```
  "chat_template_kwargs": {
      "enable_thinking": true,
      "clear_thinking": false
  }
```
When using vLLM and SGLang, thinking mode is enabled by default when sending requests. If you want to disable the thinking switch, you need to add the "chat_template_kwargs": {"enable_thinking": False} parameter.
Both support tool calling. Please use OpenAI-style tool description format for calls.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.