google / gemma-4-31b-it

Gemma 4 31B IT

Description

Gemma 4 31B IT is an open multimodal model built by Google DeepMind that handles text and image inputs, can process video as sequences of frames, and generates text output. It is designed to deliver frontier-level performance for reasoning, agentic workflows, coding, and multimodal understanding on consumer GPUs and workstations, with a 256K-token context window and support for over 140 languages. The model uses a hybrid attention mechanism that interleaves local sliding-window and full global attention, with unified Keys and Values in global layers and Proportional RoPE (p-RoPE) to support long-context performance.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration:

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA Gemma 4 31B IT Model Card

License and Terms of Use:

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information:Apache License, Version 2.0.

Deployment Geography:

Global

Use Case:

Use Case: Designed for text generation, chatbots and conversational AI, text summarization, image data extraction, reasoning, coding, multimodal understanding, function calling, and research or educational use.

Release Date:

Build.NVIDIA.com: 04/02/2026 via link
Huggingface: 04/02/2026 via link

Model Architecture:

Architecture Type: Transformer
Network Architecture: Gemma 4
Total Parameters: 30.7B
Vocabulary Size: 262,144

Input:

Input Types: Text, Image, Video
Input Formats: String, Red, Green, Blue (RGB), Video (MP4/WebM)
Input Parameters: One-Dimensional (1D), Two-Dimensional (2D), Three-Dimensional (3D)
Other Input Properties: Supports variable image aspect ratios and resolutions, configurable visual token budgets of 70, 140, 280, 560, and 1120, and video inputs up to 60 seconds at one frame per second.
Input Context Length (ISL): 256K

Output:

Output Types: Text
Output Format: String
Output Parameters: One-Dimensional (1D)
Other Output Properties: Generates text responses for chat, reasoning, coding, multimodal understanding, and function-calling workflows.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engines:

  • vLLM
  • SGLang

Supported Hardware:

  • NVIDIA Ampere: A100
  • NVIDIA Blackwell: B100, B200, GB200
  • NVIDIA Hopper: H100, H200

Preferred Operating Systems: Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

Gemma 4 31B IT v1.0

Training, Testing, and Evaluation Datasets:

Training Dataset

Data Modality: Text, Image, Audio, Other (Code)
Training Data Collection: Automated
Training Labeling: Undisclosed
Training Properties: Large-scale multimodal pre-training data spanning web documents, code, images, and audio, with a cutoff date of January 2025 and coverage in over 140 languages. Data was filtered for CSAM, sensitive data, quality, and safety.

Testing Dataset

Testing Data Collection: Undisclosed
Testing Labeling: Undisclosed
Testing Properties: Undisclosed

Evaluation Dataset

Evaluation Benchmark Score: Benchmark results are provided in the Gemma 4 31B IT Model Card comparing Gemma 4 31B IT to similar sized models across text and vision tasks. Specific scores vary by benchmark.
Evaluation Data Collection: Hybrid: Automated, Human
Evaluation Labeling: Hybrid: Automated, Human
Evaluation Properties: Gemma 4 models were evaluated across reasoning, coding, vision, long-context, and safety tasks, with benchmark table results reported for instruction-tuned models. Evaluation Methodology Notes:

  • Tau2 is reported as an average over 3 runs.
  • HLE is reported with and without search.
  • OmniDocBench 1.5 uses average edit distance, where lower is better.
  • MRCR v2 8 needle 128k is reported as an average.
  • Safety evaluations included both automated and human assessments, and all safety testing was conducted without safety filters for both text-to-text and image-to-text settings.
Detailed Benchmark Comparison Table
BenchmarkGemma 4 31BGemma 4 26B A4BGemma 4 E4BGemma 4 E2BGemma 3 27B (no think)
MMLU Pro85.2%82.6%69.4%60.0%67.6%
AIME 2026 no tools89.2%88.3%42.5%37.5%20.8%
LiveCodeBench v680.0%77.1%52.0%44.0%29.1%
Codeforces ELO21501718940633110
GPQA Diamond84.3%82.3%58.6%43.4%42.4%
Tau2 (average over 3)76.9%68.2%42.2%24.5%16.2%
HLE no tools19.5%8.7%---
HLE with search26.5%17.2%---
BigBench Extra Hard74.4%64.8%33.1%21.9%19.3%
MMMLU88.4%86.3%76.6%67.4%70.7%
Vision
MMMU Pro76.9%73.8%52.6%44.2%49.7%
OmniDocBench 1.5 (average edit distance, lower is better)0.1310.1490.1810.2900.365
MATH-Vision85.6%82.4%59.5%52.4%46.0%
MedXPertQA MM61.3%58.1%28.7%23.5%-
Audio
CoVoST--35.5433.47-
FLEURS (lower is better)--0.080.09-
Long Context
MRCR v2 8 needle 128k (average)66.4%44.1%25.4%19.1%13.5%

Inference

Acceleration Engine: vLLM
Test Hardware: NVIDIA Hopper H100

Additional Details

Best Practices

For best performance, use the partner-recommended sampling configuration of temperature=1.0, top_p=0.95, and top_k=64 across use cases.

Thinking Mode Configuration

Compared to Gemma 3, Gemma 4 uses standard system, assistant, and user roles. Thinking is enabled by including the <|think|> token at the start of the system prompt and disabled by removing it. When thinking is enabled, the model outputs its internal reasoning followed by the final answer using the structure <|channel>thought\n[Internal reasoning]<channel|>. For models other than E2B and E4B, including 31B, disabling thinking still produces the tags with an empty thought block before the final answer: <|channel>thought\n<channel|>[Final answer]. Libraries such as Transformers and llama.cpp may handle chat-template details automatically.

Multi-Turn Conversations

In multi-turn conversations, historical model output should include only the final response. Thought content from previous turns should not be included before the next user turn begins.

Multimodal Input Guidance

For optimal multimodal performance, place image content before text in the prompt. For variable image resolution, Gemma 4 uses a configurable visual token budget with supported values of 70, 140, 280, 560, and 1120. Lower budgets are better suited to classification, captioning, or video understanding where faster inference and processing many frames are preferred, while higher budgets are better suited to OCR, document parsing, or reading small text.

Video Length

The 31B model can process video as frames and supports video inputs up to 60 seconds when frames are processed at one frame per second.

Usage and Limitations

These models have certain limitations that users should be aware of.

Intended Usage

The Gemma 4 model card describes a broad intended-usage set for the Gemma 4 family across content creation, communication, research, and education. Example uses include text generation for creative and practical formats such as poems, scripts, code, marketing copy, and email drafts; chatbots and conversational AI for customer service, virtual assistants, or interactive applications; text summarization for corpora, research papers, or reports; image data extraction for interpreting and summarizing visual data into text communications; natural language processing and VLM research; language learning tools; and knowledge exploration over large bodies of text. It also notes audio processing and interaction for the smaller E2B and E4B models only.

Limitations

  • Training data quality and diversity significantly influence model capabilities, and biases or gaps in the training data can lead to limitations in model responses.
  • The scope of the training dataset affects which subject areas the model can handle effectively.
  • Models perform best on tasks framed with clear prompts and instructions; open-ended or highly complex tasks may be challenging.
  • Performance can be influenced by the amount of context provided, with longer context often helping only up to a point.
  • Natural language ambiguity and nuance, including sarcasm and figurative language, may still be difficult for the model to handle reliably.
  • The model generates responses from learned training patterns rather than from a live knowledge base, so outputs can contain incorrect or outdated factual statements.
  • The model may lack the ability to apply common-sense reasoning in some situations.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please make sure you have proper rights and permissions for all input image and video content; if image or video includes people, personal health information, or intellectual property, the image or video generated will not blur or maintain proportions of image subjects included.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

country_code