microsoft / phi-4-multimodal-instruct

Overview

Description:

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, and direct preference optimization to support precise instruction adherence and safety measures.

This model is ready for commercial/non-commercial use.

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA Phi-4-Multimodal-Instruct.

License/Terms of Use:

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the NVIDIA Community Model License. Additional Information: MIT License.

Deployment Geography:

Global

Release Date:

February 2025

Reference(s):

Phi-4-Multimodal-Instruct Model Card

Intended Use

Primary use cases:

The model is intended for broad multilingual and multimodal commercial and research use. It provides uses for general purpose AI systems and applications which require memory/compute constrained environments, latency bound scenarios, strong reasoning, function and tool calling, general image understanding, optical character recognition, and chart and table understanding.

The model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.

Out-of-scope use cases

The model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models and multimodal models, as well as performance differences across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

Release Notes

This release of Phi-4-multimodal-instruct is based on valuable user feedback from the Phi-3 series. Previously, users could use a speech recognition model to talk to the Mini and Vision models. To achieve this, users needed to use a pipeline of two models: one model to transcribe the audio to text, and another model for the language or vision tasks. This pipeline means that the core model was not provided the full breadth of input information – e.g. cannot directly observe multiple speakers, background noises, jointly align speech, vision, language information at the same time on the same representation space.

With Phi-4-multimodal-instruct, a single new open model has been trained across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. The model employed new architecture, larger vocabulary for efficiency, multilingual, and multimodal support, and better post-training techniques were used for instruction following and function calling, as well as additional data leading to substantial gains on key multimodal capabilities.

It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4-multimodal-instruct is welcomed and crucial to the model’s evolution and improvement. Thank you for being part of this journey!

Model Architecture:

Architecture Type: Phi-4-multimodal-instruct has 5.6B parameters and is a multimodal transformer model. The model has the pretrained Phi-4-mini as the backbone language model, and the advanced encoders and adapters of vision and speech

Input:

Input Type(s): Text, Image, Audio

Input Format(s): String, [.png, .jpg, .jpeg], [.mp3, .wav]

Input Parameters: [1D, 2D]

Other Properties Related to Input: Languages in training data | Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian

Note that these are for TEXT only. There is limited language support for IMAGE and AUDIO modalities.

Vision: English
Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

Vision

Any common RGB/gray image format (e.g., (".jpg", ".jpeg", ".png", ".ppm", ".bmp", ".pgm", ".tif", ".tiff", ".webp")) can be supported.
Resolution depends on the GPU memory size. Higher resolution and more images will produce more tokens, thus using more GPU memory. During training, 64 crops can be supported. If it is a square image, the resolution would be around (8_448 by 8_448). For multiple-images, at most 64 frames can be supported, but with more frames as input, the resolution of each frame needs to be reduced to fit in the memory.

Audio

Any audio format that can be loaded by soundfile package should be supported.
To keep the satisfactory performance, maximum audio length is suggested to be 40 seconds. For summarization tasks, the maximum audio length is suggested to 30 minutes.

Output:

Output Type(s): Text

Output Format(s): String

Output Parameters: 1D

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper

[Preferred/Supported] Operating System(s):

Linux

Model Version(s):

Phi-4-multimodal-instruct v1.0

Training, Testing, and Evaluation Datasets:

Data Collection Methods: [Hybrid: Automated, Human, Synthetic]

GPUS: 512 A100-80G

Training Time: 28 days

Training Data: 5T text tokens, 2.3M speech hours, and 1.1T image-text token

Training Dates: Trained between December 2024 and January 2025

Status: This is a static model trained on offline datasets with the cutoff date of June 2024 for publicly available data.

Languages in training data: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Note that these are for TEXT only. There is limited language support for IMAGE and AUDIO modalities.
Vision: English
Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

Data Overview

Phi-4-multimodal-instruct’s training data includes a wide variety of sources, totaling 5 trillion text tokens, and is a combination of:

Publicly available documents filtered for quality, selected high-quality educational data, and code
Newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (e.g., science, daily activities, theory of mind, etc.)
High quality human labeled data in chat format
Selected high-quality image-text interleave data
Synthetic and publicly available image, multi-image, and video data
Anonymized in-house speech-text pair data with strong/weak transcriptions
Selected high-quality publicly available and anonymized in-house speech data with task-specific supervisions
Selected synthetic speech data
Synthetic vision-speech data

Focus was placed on the quality of data that could potentially improve the reasoning ability for the model, and the publicly available documents were filtered to contain a preferred level of knowledge. As an example, the result of a game in premier league on a particular day might be good training data for large foundation models, but such information was removed for the Phi-4-multimodal-instruct to leave more model capacity for reasoning for the model’s small size. The data collection process involved sourcing information from publicly available documents, with a focus on filtering out undesirable documents and images. To safeguard privacy, image and text data sources were filtered to remove or scrub potentially personal data from the training data.

The decontamination process involved normalizing and tokenizing the dataset, then generating and comparing n-grams between the target dataset and benchmark datasets. Samples with matching n-grams above a threshold were flagged as contaminated and removed from the dataset. A detailed contamination report was generated, summarizing the matched text, matching ratio, and filtered results for further analysis.

Safety

Approach

The Phi-4 family of models has adopted a robust safety post-training approach. This approach leverages a variety of both open-source and in-house generated datasets. The overall technique employed for safety alignment is a combination of SFT (Supervised Fine-Tuning), DPO (Direct Preference Optimization), and RLHF (Reinforcement Learning from Human Feedback) approaches by utilizing human-labeled and synthetic English-language datasets, including publicly available datasets focusing on helpfulness and harmlessness, as well as various questions and answers targeted to multiple safety categories. For non-English languages, existing datasets were extended via machine translation. Speech Safety datasets were generated by running Text Safety datasets through Azure TTS (Text-To-Speech) Service, for both English and non-English languages. Vision (text & images) Safety datasets were created to cover harm categories identified both in public and internal multi-modal RAI datasets.

Safety Evaluation and Red-Teaming

Various evaluation techniques including red teaming, adversarial conversation simulations, and multilingual safety evaluation benchmark datasets were leveraged to evaluate Phi-4 models’ propensity to produce undesirable outputs across multiple languages and risk categories. Several approaches were used to compensate for the limitations of one approach alone. Findings across the various evaluation methods indicate that safety post-training that was done as detailed in the Phi 3 Safety Post-Training paper had a positive impact across multiple languages and risk categories as observed by refusal rates (refusal to output undesirable outputs) and robustness to jailbreak techniques. Details on prior red team evaluations across Phi models can be found in the Phi 3 Safety Post-Training paper. For this release, the red teaming effort focused on the newest Audio input modality and on the following safety areas: harmful content, self-injury risks, and exploits. The model was found to be more susceptible to providing undesirable outputs when attacked with context manipulation or persuasive techniques. These findings applied to all languages, with the persuasive techniques mostly affecting French and Italian. This highlights the need for industry-wide investment in the development of high-quality safety evaluation datasets across multiple languages, including low resource languages, and risk areas that account for cultural nuances where those languages are spoken.

Vision Safety Evaluation

To assess model safety in scenarios involving both text and images, Microsoft’s Azure AI Evaluation SDK was utilized. This tool facilitates the simulation of single-turn conversations with the target model by providing prompt text and images designed to incite harmful responses. The target model's responses are subsequently evaluated by a capable model across multiple harm categories, including violence, sexual content, self-harm, hateful and unfair content, with each response scored based on the severity of the harm identified. The evaluation results were compared with those of Phi-3.5-Vision and open-source models of comparable size. In addition, we ran both an internal and the public RTVLM and VLGuard multi-modal (text & vision) RAI benchmarks, once again comparing scores with Phi-3.5-Vision and open-source models of comparable size. However, the model may be susceptible to language-specific attack prompts and cultural context.

Audio Safety Evaluation

In addition to extensive red teaming, the Safety of the model was assessed through three distinct evaluations. First, as performed with Text and Vision inputs, Microsoft’s Azure AI Evaluation SDK was leveraged to detect the presence of harmful content in the model’s responses to Speech prompts. Second, Microsoft’s Speech Fairness evaluation was run to verify that Speech-To-Text transcription worked well across a variety of demographics. Third, we proposed and evaluated a mitigation approach via a system message to help prevent the model from inferring sensitive attributes (such as gender, sexual orientation, profession, medical condition, etc...) from the voice of a user.

Model Quality

To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform. Users can refer to the Phi-4-Mini model card for details of language benchmarks. Below is a high-level overview of the model quality on representative speech and vision benchmarks:

Speech Benchmarks

Phi-4-multimodal-instruct demonstrated strong performance in speech tasks:

Surpassed expert ASR model WhisperV3 and ST models SeamlessM4T-v2-Large in automatic speech recognition (ASR) and speech translation (ST).
Ranked number 1 on the Huggingface OpenASR leaderboard with a word error rate of 6.14% compared to the current best model at 6.5% as of February 18, 2025.
First open-sourced model capable of performing speech summarization, with performance close to GPT4o.
Exhibited a gap with closed models like Gemini-2.0-Flash and GPT-4o-realtime-preview on the speech QA task. Efforts are ongoing to improve this capability in future iterations.

Vision Benchmarks

Vision-Speech Tasks

Phi-4-multimodal-instruct can process both image and audio together. The table below shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other state-of-the-art omni models, Phi-4-multimodal-instruct achieves stronger performance on multiple benchmarks.

Benchmarks	Phi-4-multimodal-instruct	InternOmni-7B	Gemini-2.0-Flash-Lite-prv-02-05	Gemini-2.0-Flash	Gemini-1.5-Pro
s_AI2D	68.9	53.9	62.0	69.4	67.7
s_ChartQA	69.0	56.1	35.5	51.3	46.9
s_DocVQA	87.3	79.9	76.0	80.3	78.2
s_InfoVQA	63.7	60.3	59.4	63.6	66.1
Average	72.2	62.6	58.2	66.2	64.7

Vision Tasks

To understand the vision capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of zero-shot benchmarks using an internal benchmark platform. Below is a high-level overview of the model quality on representative benchmarks:

Dataset	Phi-4-multimodal-ins	Phi-3.5-vision-ins	Qwen 2.5-VL-3B-ins	Intern VL 2.5-4B	Qwen 2.5-VL-7B-ins	Intern VL 2.5-8B	Gemini 2.0-Flash Lite-prv-0205	Gemini2.0-Flash	Claude-3.5-Sonnet-2024-10-22	Gpt-4o-2024-11-20
Popular aggregated benchmark	55.1	43.0	47.0	48.3	51.8	50.6	54.1	64.7	55.8	61.7
MMBench (dev-en)	86.7	81.9	84.3	86.8	87.8	88.2	85.0	90.0	86.7	89.0
MMMU-Pro (std / vision)	38.5	21.8	29.9	32.4	38.7	34.4	45.1	54.4	54.3	53.0
ScienceQA Visual (img-test)	97.5	91.3	79.4	96.2	87.7	97.3	85.0	88.3	81.2	88.2
MathVista (testmini)	62.4	43.9	60.8	51.2	67.8	56.7	57.6	47.2	56.9	56.1
InterGPS	48.6	36.3	48.3	53.7	52.7	54.1	57.9	65.4	47.1	49.1
AI2D	82.3	78.1	78.4	80.0	82.6	83.0	77.6	82.1	70.6	83.8
ChartQA	81.4	81.8	80.0	79.1	85.0	81.0	73.0	79.0	78.4	75.1
DocVQA	93.2	69.3	93.9	91.6	95.7	93.0	91.2	92.1	95.2	90.9
InfoVQA	72.7	36.6	77.1	72.1	82.6	77.6	73.0	77.8	74.3	71.9
TextVQA (val)	75.6	72.0	76.8	70.9	77.7	74.8	72.9	74.4	58.6	73.1
OCR Bench	84.4	63.8	82.2	71.6	87.7	74.8	75.7	81.0	77.0	77.7
POPE	85.6	86.1	87.9	89.4	87.5	89.1	87.5	88.0	82.6	86.5
BLINK	61.3	57.0	48.1	51.2	55.3	52.5	59.3	64.0	56.9	62.4
Video MME (16 frames)	55.0	50.8	56.5	57.3	58.2	58.7	58.8	65.5	60.2	68.2
Average	72.0	60.9	68.7	68.8	73.3	71.1	70.2	74.3	69.1	72.4

Below are the comparison results on existing multi-image tasks. On average, Phi-4-multimodal-instruct outperforms competitor models of the same size and is competitive with much bigger models on multi-frame capabilities. BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.

Dataset	Phi-4-multimodal-instruct	Qwen2.5-VL-3B-Instruct	InternVL 2.5-4B	Qwen2.5-VL-7B-Instruct	InternVL 2.5-8B	Gemini-2.0-Flash-Lite-prv-02-05	Gemini-2.0-Flash	Claude-3.5-Sonnet-2024-10-22	Gpt-4o-2024-11-20
Art Style	86.3	58.1	59.8	65.0	65.0	76.9	76.9	68.4	73.5
Counting	60.0	67.5	60.0	66.7	71.7	45.8	69.2	60.8	65.0
Forensic Detection	90.2	34.8	22.0	43.9	37.9	31.8	74.2	63.6	71.2
Functional Correspondence	30.0	20.0	26.9	22.3	27.7	48.5	53.1	34.6	42.3
IQ Test	22.7	25.3	28.7	28.7	28.7	28.0	30.7	20.7	25.3
Jigsaw	68.7	52.0	71.3	69.3	53.3	62.7	69.3	61.3	68.7
Multi-View Reasoning	76.7	44.4	44.4	54.1	45.1	55.6	41.4	54.9	54.1
Object Localization	52.5	55.7	53.3	55.7	58.2	63.9	67.2	58.2	65.6
Relative Depth	69.4	68.5	68.5	80.6	76.6	81.5	72.6	66.1	73.4
Relative Reflectance	26.9	38.8	38.8	32.8	38.8	33.6	34.3	38.1	38.1
Semantic Correspondence	52.5	32.4	33.8	28.8	24.5	56.1	55.4	43.9	47.5
Spatial Relation	72.7	80.4	86.0	88.8	86.7	74.1	79.0	74.8	83.2
Visual Correspondence	67.4	28.5	39.5	50.0	44.2	84.9	91.3	72.7	82.6
Visual Similarity	86.7	67.4	88.1	87.4	85.2	87.4	80.7	79.3	83.0
Overall	61.3	48.1	51.2	55.3	52.5	59.3	64.0	56.9	62.4

Usage

Input Format

Given the nature of the training data, Phi-4-Multimodal-Instruct model is best suited for prompts using the chat format as follows:

Text Chat Format

This format is used for general conversation and instructions:

<|system|>You are a helpful assistant.<|end|><|user|>How to explain Internet for a medieval knight?<|end|><|assistant|>

Tool-enabled Function Call Format for Text

This format is used when the user wants the model to provide function calls based on the given tools. The user should provide the available tools in the system prompt, wrapped by <|tool|> and <|/tool|> tokens. The tools should be specified in JSON format, using a JSON dump structure. For example:

<|system|>You are a helpful assistant with some tools.<|tool|>
[{"name": "get_weather_updates", "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", "parameters": {"city": {"description": "The name of the city for which to retrieve weather information.", "type": "str", "default": "London"}}}]
<|/tool|><|end|><|user|>What is the weather like in Paris today?<|end|><|assistant|>

Vision-Language Format

This format is used for conversation with image:

<|user|><|image_1|>Describe the image in detail.<|end|><|assistant|>

For multiple images, the user needs to insert multiple image placeholders in the prompt as below:

<|user|><|image_1|><|image_2|><|image_3|>Summarize the content of the images.<|end|><|assistant|>

Speech-Language Format

This format is used for various speech and audio tasks:

<|user|><|audio_1|>{task prompt}<|end|><|assistant|>

The task prompt can vary for different task.

Automatic Speech Recognition:

<|user|><|audio_1|>Transcribe the audio clip into text.<|end|><|assistant|>

Automatic Speech Translation:

<|user|><|audio_1|>Translate the audio to {lang}.<|end|><|assistant|>

Automatic Speech Translation with Chain-of-thought:

<|user|><|audio_1|>Transcribe the audio to text, and then translate the audio to {lang}. Use <sep> as a separator between the original transcript and the translation.<|end|><|assistant|>

Spoken-query Question Answering:

<|user|><|audio_1|><|end|><|assistant|>

Vision-Speech Format

This format is used for conversation with image and audio. The audio may contain query related to the image:

<|user|><|image_1|><|audio_1|><|end|><|assistant|>

For multiple images, the user needs to insert multiple image placeholders in the prompt as below:

<|user|><|image_1|><|image_2|><|image_3|><|audio_1|><|end|><|assistant|>

Vision

Any common RGB/gray image format (e.g., (".jpg", ".jpeg", ".png", ".ppm", ".bmp", ".pgm", ".tif", ".tiff", ".webp")) can be supported.
Resolution depends on the GPU memory size. Higher resolution and more images will produce more tokens, thus using more GPU memory. During training, 64 crops can be supported. If it is a square image, the resolution would be around (8_448 by 8_448). For multiple-images, at most 64 frames can be supported, but with more frames as input, the resolution of each frame needs to be reduced to fit in the memory.

Audio

Any audio format that can be loaded by soundfile package should be supported.
To keep the satisfactory performance, maximum audio length is suggested to be 40 seconds. For summarization tasks, the maximum audio length is suggested to 30 minutes.

Loading the Model Locally

After obtaining the Phi-4-Multimodal-Instruct model checkpoints, users can use this sample code for inference.

import requests
import torch
import os
from PIL import Image
import soundfile
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig,pipeline,AutoTokenizer

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-4-Multimodal-Instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
_attn_implementation='flash_attention_2',
).cuda()

generation_config = GenerationConfig.from_pretrained(model_path, 'generation_config.json')

user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
print(f'>>> Prompt\n{prompt}')
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')


speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'

print(f'>>> Prompt\n{prompt}')
audio = soundfile.read('https://voiceage.com/wbsamples/in_mono/Trailer.wav')
inputs = processor(text=prompt, audios=[audio], return_tensors='pt').to('cuda:0')
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

Inference:

Engine: vLLM

Test Hardware: NVIDIA H100

Responsible AI Considerations

Like other language models, the Phi family of models can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:

Quality of Service: The Phi models are trained primarily on English language content across text, speech, and visual inputs, with some additional multilingual coverage. Performance may vary significantly across different modalities and languages:
- Text: Languages other than English will experience reduced performance, with varying levels of degradation across different non-English languages. English language varieties with less representation in the training data may perform worse than standard American English.
- Speech: Speech recognition and processing shows similar language-based performance patterns, with optimal performance for standard American English accents and pronunciations. Other English accents, dialects, and non-English languages may experience lower recognition accuracy and response quality. Background noise, audio quality, and speaking speed can further impact performance.
- Vision: Visual processing capabilities may be influenced by cultural and geographical biases in the training data. The model may show reduced performance when analyzing images containing text in non-English languages or visual elements more commonly found in non-Western contexts. Image quality, lighting conditions, and composition can also affect processing accuracy.
Multilingual performance and safety gaps: We believe it is important to make language models more widely available across different languages, but the Phi 4 models still exhibit challenges common across multilingual releases. As with any deployment of LLMs, developers will be better positioned to test for performance or safety gaps for their linguistic and cultural context and customize the model with additional fine-tuning and appropriate safeguards.
Representation of Harms & Perpetuation of Stereotypes: These models can over- or under-represent groups of people, erase representation of some groups, or reinforce demeaning or negative stereotypes. Despite safety post-training, these limitations may still be present due to differing levels of representation of different groups, cultural contexts, or prevalence of examples of negative stereotypes in training data that reflect real-world patterns and societal biases.
Inappropriate or Offensive Content: These models may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the case.
Information Reliability: Language models can generate nonsensical content or fabricate content that might sound reasonable but is inaccurate or outdated.
Limited Scope for Code: The majority of Phi 4 training data is based in Python and uses common packages such as "typing, math, random, collections, datetime, itertools". If the model generates Python scripts that utilize other packages or scripts in other languages, it is strongly recommended that users manually verify all API uses.
Long Conversation: Phi 4 models, like other models, can in some cases generate responses that are repetitive, unhelpful, or inconsistent in very long chat sessions in both English and non-English languages. Developers are encouraged to place appropriate mitigations, like limiting conversation turns to account for the possible conversational drift.

Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Phi 4 family of models are general purpose models. As developers plan to deploy these models for specific use cases, they are encouraged to fine-tune the models for their use case and leverage the models as part of broader AI systems with language-specific safeguards in place. Important areas for consideration include:

Allocation: Models may not be suitable for scenarios that could have consequential impact on legal status or the allocation of resources or life opportunities (ex: housing, employment, credit, etc.) without further assessments and additional debiasing techniques.
High-Risk Scenarios: Developers should assess the suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes providing advice in sensitive or expert domains where accuracy and reliability are critical (ex: legal or health advice). Additional safeguards should be implemented at the application level according to the deployment context.
Misinformation: Models may produce inaccurate information. Developers should follow transparency best practices and inform end-users they are interacting with an AI system. At the application level, developers can build feedback mechanisms and pipelines to ground responses in use-case specific, contextual information, a technique known as Retrieval Augmented Generation (RAG).
Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case.
Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.

Ethical Considerations:

Ethical considerations and guidelines. NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns here.