Kimi-K2-Thinking
Description
Kimi K2 Thinking is the most capable open-source thinking model. Starting with Kimi K2 Thinking, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, Kimi K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.
This model is ready for commercial/non-commercial use.
Third-Party Community Consideration:
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Kimi-K2-Thinking Model Card
License and Terms of Use:
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information: Modified MIT License.
Deployment Geography:
Global
Use Case:
This model is designed for advanced reasoning, agentic AI with deep thinking capabilities, multi-step problem-solving with tool orchestration, complex mathematical reasoning, coding with autonomous workflows, and research tasks requiring long-horizon agency. It can be used for autonomous research workflows, complex coding projects spanning hundreds of steps, mathematical problem-solving with extended reasoning, web browsing and information synthesis, and tool-orchestrated task execution.
Key Features
- Deep Thinking & Tool Orchestration: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
- Native INT4 Quantization: Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
- Stable Long-Horizon Agency: Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps.
Release Date:
build.nvidia.com 12/08/2025: Available via link
Huggingface Available via link
Reference(s):
References:
Model Architecture:
Architecture Type: Transformer
- Network Architecture: Mixture-of-Experts (MoE)
- Total Parameters: 1T
- Activated Parameters: 32B
- Number of Layers (Dense layer included): 61
- Number of Dense Layers: 1
- Attention Hidden Dimension: 7168
- MoE Hidden Dimension (per Expert): 2048
- Number of Attention Heads: 64
- Number of Experts: 384
- Selected Experts per Token: 8
- Number of Shared Experts: 1
- Vocabulary Size: 160K
- Context Length: 256K
- Attention Mechanism: MLA
- Activation Function: SwiGLU
- Base Model: Kimi-K2-Base
Input:
Input Types: Text, Tool Definitions
Input Formats: String, JSON
Input Parameters: One-Dimensional (1D)
Other Input Properties: The model has a context window of up to 256,000 tokens. Supports interleaved reasoning traces and tool calls.
Input Context Length (ISL): 256K
Output:
Output Format: String, JSON (for tool calls)
Output Parameters: One-Dimensional (1D)
Other Output Properties: Includes separate reasoning_content traces alongside final responses. Supports streaming and non-streaming modes.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engines:
- vLLM
- SGLang
- KTransformers
Supported Hardware:
- NVIDIA Blackwell: B100, B200, GB200
- NVIDIA Hopper: H100, H200
Operating Systems: Linux
Model Version(s):
Kimi K2 Thinking v1.0
Training, Testing, and Evaluation Datasets:
Training Dataset
Training Data Collection: Undisclosed
Training Labeling: Undisclosed
Training Properties: Trained with Quantization-Aware Training (QAT) during post-training phase for native INT4 support.
Testing Dataset
Testing Data Collection: Undisclosed
Testing Labeling: Undisclosed
Testing Properties: Undisclosed
Evaluation Dataset
Evaluation Benchmark Score:
- HLE (Text-only) with tools: 44.9%
- HLE Heavy Mode: 51.0%
- AIME25 with python: 99.1%
- AIME25 Heavy Mode: 100.0%
- HMMT25 with python: 95.1%
- HMMT25 Heavy Mode: 97.5%
- GPQA (no tools): 84.5%
- BrowseComp with tools: 60.2%
- BrowseComp-ZH with tools: 62.3%
- SWE-bench Verified with tools: 71.3%
- MMLU-Pro (no tools): 84.6%
- LiveCodeBench: 64.8% Pass@1
Evaluation Data Collection: Hybrid: Human, Automated
Evaluation Labeling: Human
Evaluation Properties: HLE, AIME25, HMMT25, IMO-AnswerBench, GPQA, MMLU-Pro, MMLU-Redux, Longform Writing, HealthBench, BrowseComp, BrowseComp-ZH, Seal-0, FinSearchComp-T3, Frames, SWE-bench Verified, SWE-bench Multilingual, Multi-SWE-bench, SciCode, LiveCodeBench, Terminal-Bench
Evaluation Results
| Benchmark | Setting | Kimi K2 Thinking | GPT-5 (High) | Claude Sonnet 4.5 (Thinking) | K2 0905 | DeepSeek-V3.2 | Grok-4 |
|---|---|---|---|---|---|---|---|
| Reasoning Tasks | |||||||
| HLE (Text-only) | no tools | 23.9 | 26.3 | 19.8* | 7.9 | 19.8 | 25.4 |
| HLE | w/ tools | 44.9 | 41.7* | 32.0* | 21.7 | 20.3* | 41.0 |
| HLE | heavy | 51.0 | 42.0 | - | - | - | 50.7 |
| AIME25 | no tools | 94.5 | 94.6 | 87.0 | 51.0 | 89.3 | 91.7 |
| AIME25 | w/ python | 99.1 | 99.6 | 100.0 | 75.2 | 58.1* | 98.8 |
| AIME25 | heavy | 100.0 | 100.0 | - | - | - | 100.0 |
| HMMT25 | no tools | 89.4 | 93.3 | 74.6* | 38.8 | 83.6 | 90.0 |
| HMMT25 | w/ python | 95.1 | 96.7 | 88.8* | 70.4 | 49.5* | 93.9 |
| HMMT25 | heavy | 97.5 | 100.0 | - | - | - | 96.7 |
| IMO-AnswerBench | no tools | 78.6 | 76.0* | 65.9* | 45.8 | 76.0* | 73.1 |
| GPQA | no tools | 84.5 | 85.7 | 83.4 | 74.2 | 79.9 | 87.5 |
| Agentic Search Tasks | |||||||
| BrowseComp | w/ tools | 60.2 | 54.9 | 24.1 | 7.4 | 40.1 | - |
| BrowseComp-ZH | w/ tools | 62.3 | 63.0* | 42.4* | 22.2 | 47.9 | - |
| Seal-0 | w/ tools | 56.3 | 51.4* | 53.4* | 25.2 | 38.5* | - |
| FinSearchComp-T3 | w/ tools | 47.4 | 48.5* | 44.0* | 10.4 | 27.0* | - |
| Frames | w/ tools | 87.0 | 86.0* | 85.0* | 58.1 | 80.2* | - |
| Coding Tasks | |||||||
| SWE-bench Verified | w/ tools | 71.3 | 74.9 | 77.2 | 69.2 | 67.8 | - |
| SWE-bench Multilingual | w/ tools | 61.1 | 55.3* | 68.0 | 55.9 | 57.9 | - |
| Multi-SWE-bench | w/ tools | 41.9 | 39.3* | 44.3 | 33.5 | 30.6 | - |
| SciCode | no tools | 44.8 | 42.9 | 44.7 | 30.7 | 37.7 | - |
| LiveCodeBench | no tools | 64.8 | 64.4 | 60.4 | 49.8 | 60.8 | - |
| Terminal-Bench | w/ tools | 36.8 | 42.0 | - | 5.0 | 26.7 | - |
| General Tasks | |||||||
| MMLU-Pro | no tools | 84.6 | 87.1 | 87.5 | 81.9 | 85.0 | - |
| MMLU-Redux | no tools | 94.4 | 95.3 | 95.6 | 92.7 | 93.7 | - |
| Longform Writing | no tools | 73.8 | 71.4 | 79.8 | 62.8 | 72.5 | - |
| HealthBench | no tools | 58.0 | 67.2 | 44.2 | 43.8 | 46.9 | - |
• Data points marked with * are taken directly from the model's tech report or blog.
• All benchmark results are reported under INT4 precision.
• Heavy Mode employs an efficient parallel strategy: it first rolls out eight trajectories simultaneously, then reflectively aggregates all outputs to generate the final result.
Inference
Acceleration Engine: vLLM, SGLang, KTransformers
Test Hardware: NVIDIA H100, NVIDIA A100
Additional Details
Deployment
You can access Kimi K2 Thinking's API on https://platform.moonshot.ai , we provide OpenAI/Anthropic-compatible API for you.
Currently, Kimi K2 Thinking is recommended to run on the following inference engines:
- vLLM
- SGLang
- KTransformers
Deployment examples can be found in the Model Deployment Guide.
Model Usage
Once the local inference service is up, you can interact with it through the chat endpoint:
def simple_chat(client: openai.OpenAI, model_name: str):
messages = [
{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
{"role": "user", "content": [{"type": "text", "text": "which one is bigger, 9.11 or 9.9? think carefully."}]},
]
response = client.chat.completions.create(
model=model_name,
messages=messages,
stream=False,
temperature=1.0,
max_tokens=4096
)
print(f"k2 answer: {response.choices[0].message.content}")
print("=====below is reasoning content======")
print(f"reasoning content: {response.choices[0].message.reasoning_content}")
[!NOTE]
The recommended temperature for Kimi K2 Thinking istemperature = 1.0.
If no special instructions are required, the system prompt above is a good default.
Tool Calling
Kimi K2 Thinking has the same tool calling settings as Kimi K2 Instruct.
To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them.
The following example demonstrates calling a weather tool end-to-end:
# Your tool implementation
def get_weather(city: str) -> dict:
return {"weather": "Sunny"}
# Tool schema definition
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Retrieve current weather information. Call this when the user asks about the weather.",
"parameters": {
"type": "object",
"required": ["city"],
"properties": {
"city": {
"type": "string",
"description": "Name of the city"
}
}
}
}
}]
# Map tool names to their implementations
tool_map = {
"get_weather": get_weather
}
def tool_call_with_client(client: OpenAI, model_name: str):
messages = [
{"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
{"role": "user", "content": "What's the weather like in Beijing today? Use the tool to check."}
]
finish_reason = None
while finish_reason is None or finish_reason == "tool_calls":
completion = client.chat.completions.create(
model=model_name,
messages=messages,
temperature=1.0,
tools=tools, # tool list defined above
tool_choice="auto"
)
choice = completion.choices[0]
finish_reason = choice.finish_reason
if finish_reason == "tool_calls":
messages.append(choice.message)
for tool_call in choice.message.tool_calls:
tool_call_name = tool_call.function.name
tool_call_arguments = json.loads(tool_call.function.arguments)
tool_function = tool_map[tool_call_name]
tool_result = tool_function(**tool_call_arguments)
print("tool_result:", tool_result)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call_name,
"content": json.dumps(tool_result)
})
print("-" * 100)
print(choice.message.content)
The tool_call_with_client function implements the pipeline from user query to tool execution.
This pipeline requires the inference engine to support Kimi K2 Thinking's native tool-parsing logic.
For more information, see the Tool Calling Guide.
Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
Please report security vulnerabilities or NVIDIA AI Concerns here
