moonshotai / kimi-k2-thinking

Kimi-K2-Thinking

Description

Kimi K2 Thinking is the most capable open-source thinking model. Starting with Kimi K2 Thinking, we built it as a thinking agent that reasons step-by-step while dynamically invoking tools. It sets a new state-of-the-art on Humanity's Last Exam (HLE), BrowseComp, and other benchmarks by dramatically scaling multi-step reasoning depth and maintaining stable tool-use across 200–300 sequential calls. At the same time, Kimi K2 Thinking is a native INT4 quantization model with 256k context window, achieving lossless reductions in inference latency and GPU memory usage.

This model is ready for commercial/non-commercial use.


Third-Party Community Consideration:

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Kimi-K2-Thinking Model Card


License and Terms of Use:

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information: Modified MIT License.


Deployment Geography:

Global


Use Case:

This model is designed for advanced reasoning, agentic AI with deep thinking capabilities, multi-step problem-solving with tool orchestration, complex mathematical reasoning, coding with autonomous workflows, and research tasks requiring long-horizon agency. It can be used for autonomous research workflows, complex coding projects spanning hundreds of steps, mathematical problem-solving with extended reasoning, web browsing and information synthesis, and tool-orchestrated task execution.

Key Features

  • Deep Thinking & Tool Orchestration: End-to-end trained to interleave chain-of-thought reasoning with function calls, enabling autonomous research, coding, and writing workflows that last hundreds of steps without drift.
  • Native INT4 Quantization: Quantization-Aware Training (QAT) is employed in post-training stage to achieve lossless 2x speed-up in low-latency mode.
  • Stable Long-Horizon Agency: Maintains coherent goal-directed behavior across up to 200–300 consecutive tool invocations, surpassing prior models that degrade after 30–50 steps.

Release Date:

build.nvidia.com 12/08/2025: Available via link

Huggingface Available via link


Reference(s):

References:


Model Architecture:

Architecture Type: Transformer

  • Network Architecture: Mixture-of-Experts (MoE)
  • Total Parameters: 1T
  • Activated Parameters: 32B
  • Number of Layers (Dense layer included): 61
  • Number of Dense Layers: 1
  • Attention Hidden Dimension: 7168
  • MoE Hidden Dimension (per Expert): 2048
  • Number of Attention Heads: 64
  • Number of Experts: 384
  • Selected Experts per Token: 8
  • Number of Shared Experts: 1
  • Vocabulary Size: 160K
  • Context Length: 256K
  • Attention Mechanism: MLA
  • Activation Function: SwiGLU
  • Base Model: Kimi-K2-Base

Input:

Input Types: Text, Tool Definitions

Input Formats: String, JSON

Input Parameters: One-Dimensional (1D)

Other Input Properties: The model has a context window of up to 256,000 tokens. Supports interleaved reasoning traces and tool calls.

Input Context Length (ISL): 256K

Output:

Output Format: String, JSON (for tool calls)

Output Parameters: One-Dimensional (1D)

Other Output Properties: Includes separate reasoning_content traces alongside final responses. Supports streaming and non-streaming modes.


Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.


Software Integration:

Runtime Engines:

  • vLLM
  • SGLang
  • KTransformers

Supported Hardware:

  • NVIDIA Blackwell: B100, B200, GB200
  • NVIDIA Hopper: H100, H200

Operating Systems: Linux


Model Version(s):

Kimi K2 Thinking v1.0


Training, Testing, and Evaluation Datasets:

Training Dataset

Training Data Collection: Undisclosed

Training Labeling: Undisclosed

Training Properties: Trained with Quantization-Aware Training (QAT) during post-training phase for native INT4 support.

Testing Dataset

Testing Data Collection: Undisclosed

Testing Labeling: Undisclosed

Testing Properties: Undisclosed

Evaluation Dataset

Evaluation Benchmark Score:

  • HLE (Text-only) with tools: 44.9%
  • HLE Heavy Mode: 51.0%
  • AIME25 with python: 99.1%
  • AIME25 Heavy Mode: 100.0%
  • HMMT25 with python: 95.1%
  • HMMT25 Heavy Mode: 97.5%
  • GPQA (no tools): 84.5%
  • BrowseComp with tools: 60.2%
  • BrowseComp-ZH with tools: 62.3%
  • SWE-bench Verified with tools: 71.3%
  • MMLU-Pro (no tools): 84.6%
  • LiveCodeBench: 64.8% Pass@1

Evaluation Data Collection: Hybrid: Human, Automated

Evaluation Labeling: Human

Evaluation Properties: HLE, AIME25, HMMT25, IMO-AnswerBench, GPQA, MMLU-Pro, MMLU-Redux, Longform Writing, HealthBench, BrowseComp, BrowseComp-ZH, Seal-0, FinSearchComp-T3, Frames, SWE-bench Verified, SWE-bench Multilingual, Multi-SWE-bench, SciCode, LiveCodeBench, Terminal-Bench

Evaluation Results

BenchmarkSettingKimi K2 ThinkingGPT-5 (High)Claude Sonnet 4.5 (Thinking)K2 0905DeepSeek-V3.2Grok-4
Reasoning Tasks
HLE (Text-only)no tools23.926.319.8*7.919.825.4
HLEw/ tools44.941.7*32.0*21.720.3*41.0
HLEheavy51.042.0---50.7
AIME25no tools94.594.687.051.089.391.7
AIME25w/ python99.199.6100.075.258.1*98.8
AIME25heavy100.0100.0---100.0
HMMT25no tools89.493.374.6*38.883.690.0
HMMT25w/ python95.196.788.8*70.449.5*93.9
HMMT25heavy97.5100.0---96.7
IMO-AnswerBenchno tools78.676.0*65.9*45.876.0*73.1
GPQAno tools84.585.783.474.279.987.5
Agentic Search Tasks
BrowseCompw/ tools60.254.924.17.440.1-
BrowseComp-ZHw/ tools62.363.0*42.4*22.247.9-
Seal-0w/ tools56.351.4*53.4*25.238.5*-
FinSearchComp-T3w/ tools47.448.5*44.0*10.427.0*-
Framesw/ tools87.086.0*85.0*58.180.2*-
Coding Tasks
SWE-bench Verifiedw/ tools71.374.977.269.267.8-
SWE-bench Multilingualw/ tools61.155.3*68.055.957.9-
Multi-SWE-benchw/ tools41.939.3*44.333.530.6-
SciCodeno tools44.842.944.730.737.7-
LiveCodeBenchno tools64.864.460.449.860.8-
Terminal-Benchw/ tools36.842.0-5.026.7-
General Tasks
MMLU-Prono tools84.687.187.581.985.0-
MMLU-Reduxno tools94.495.395.692.793.7-
Longform Writingno tools73.871.479.862.872.5-
HealthBenchno tools58.067.244.243.846.9-
• Bold denotes best score for the benchmark.
• Data points marked with * are taken directly from the model's tech report or blog.
• All benchmark results are reported under INT4 precision.
• Heavy Mode employs an efficient parallel strategy: it first rolls out eight trajectories simultaneously, then reflectively aggregates all outputs to generate the final result.

Inference

Acceleration Engine: vLLM, SGLang, KTransformers

Test Hardware: NVIDIA H100, NVIDIA A100


Additional Details

Deployment

You can access Kimi K2 Thinking's API on https://platform.moonshot.ai , we provide OpenAI/Anthropic-compatible API for you.

Currently, Kimi K2 Thinking is recommended to run on the following inference engines:

  • vLLM
  • SGLang
  • KTransformers

Deployment examples can be found in the Model Deployment Guide.


Model Usage

Once the local inference service is up, you can interact with it through the chat endpoint:

def simple_chat(client: openai.OpenAI, model_name: str):
    messages = [
        {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
        {"role": "user", "content": [{"type": "text", "text": "which one is bigger, 9.11 or 9.9? think carefully."}]},
    ]
    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        stream=False,
        temperature=1.0,
        max_tokens=4096
    )
    print(f"k2 answer: {response.choices[0].message.content}")
    print("=====below is reasoning content======")
    print(f"reasoning content: {response.choices[0].message.reasoning_content}")

[!NOTE]
The recommended temperature for Kimi K2 Thinking is temperature = 1.0.
If no special instructions are required, the system prompt above is a good default.


Tool Calling

Kimi K2 Thinking has the same tool calling settings as Kimi K2 Instruct.
To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them.

The following example demonstrates calling a weather tool end-to-end:

# Your tool implementation
def get_weather(city: str) -> dict:
    return {"weather": "Sunny"}

# Tool schema definition
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Retrieve current weather information. Call this when the user asks about the weather.",
        "parameters": {
            "type": "object",
            "required": ["city"],
            "properties": {
                "city": {
                    "type": "string",
                    "description": "Name of the city"
                }
            }
        }
    }
}]

# Map tool names to their implementations
tool_map = {
    "get_weather": get_weather
}

def tool_call_with_client(client: OpenAI, model_name: str):
    messages = [
        {"role": "system", "content": "You are Kimi, an AI assistant created by Moonshot AI."},
        {"role": "user", "content": "What's the weather like in Beijing today? Use the tool to check."}
    ]
    finish_reason = None
    while finish_reason is None or finish_reason == "tool_calls":
        completion = client.chat.completions.create(
            model=model_name,
            messages=messages,
            temperature=1.0,
            tools=tools,          # tool list defined above
            tool_choice="auto"
        )
        choice = completion.choices[0]
        finish_reason = choice.finish_reason
        if finish_reason == "tool_calls":
            messages.append(choice.message)
            for tool_call in choice.message.tool_calls:
                tool_call_name = tool_call.function.name
                tool_call_arguments = json.loads(tool_call.function.arguments)
                tool_function = tool_map[tool_call_name]
                tool_result = tool_function(**tool_call_arguments)
                print("tool_result:", tool_result)
                
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "name": tool_call_name,
                    "content": json.dumps(tool_result)
                })
    print("-" * 100)
    print(choice.message.content)

The tool_call_with_client function implements the pipeline from user query to tool execution.
This pipeline requires the inference engine to support Kimi K2 Thinking's native tool-parsing logic.
For more information, see the Tool Calling Guide.


Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here

country_code