Deployment FAQ

General Deployment FAQ

What are the hardware and software requirements for hosting a NIM?

Hardware and software requirements vary depending on the NIM used. Please refer to the NIM documentation for the specific NIM you want to use to view this information.

How will the deployment of NIM be optimized based on hardware?

The system automatically chooses the best backend based on the hardware present, aiming to maximize performance and ease of use.

Each NIM is its own Docker container which contains a model as well as the runtime capable of running the model on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. The NIM automatically downloads the model from NGC, leveraging a local system cache if available. Each NIM is built from a common base, so once a NIM has been downloaded, downloading additional NIMs is extremely fast.

When a NIM is first deployed, NIM inspects the local hardware configuration, and the available model versions available in the model registry, and automatically chooses the best version of the model for the available hardware. For a subset of NVIDIA GPUs (see Support Matrix), NIM downloads the optimized TRT engine and runs an inference using the TRT-LLM library. For all other NVIDIA GPUs, NIM downloads a non-optimized model and runs it using the vLLM library.

Can I run more than one model within a single container?

No. The target deployment recommends one model per pod/container. Additionally, the GPU ram needed by the model is only a portion of the total consumed memory. About 90% of the remaining RAM is allocated by the KV cache to help improve inference performance.

NGC & API Key questions

What is an NGC API key?

Whether you are interacting with NIM through the Hosted API Endpoints, or pulling NIM from the NVIDIA NGC™ cataolg to host locally or in cloud infrastructure, you will need an NGC API Key.

The API key is unique to you and tied to your account. It allows you to interact with NIM through the API and pull NIM microservices to another system. Here is how to generate a NGC API Key or Personal Key.

📒NOTE: Personal keys allow you to configure an expiration date, revoke or delete the key using an action button, and rotate the key as needed. For more information about key types, please refer the NGC User Guide.

❗️Important: Keep your key secret and in a safe place. Do not share it or store it in a place where others can see or copy it.

What is the NGC CLI?

NVIDIA NGC CLI is a command-line interface tool for managing Docker containers in the NVIDIA NGC Registry. With NGC CLI installed, you can easily pull containers and NIM from the NGC website.

Does NIM support API_KEY authentication, like OpenAI?

No. Generally speaking, this would be a service mesh. A service mesh addresses the challenges of distributed microservices architecture through the establishment of rules which allow/deny access to any part of a microservice. The Helm chart which NVIDIA provides is compatible with a service mesh but does not install one.

Does NIM expose API endpoints for monitoring and metrics?

Yes, metrics are served on, by default. There are also examples on how to use the metrics with an example dashboard for Prometheus here.

If I customize a model, how can I get it packaged as a NIM?

If you customize a NIM supported model using a supported fine-tuning method, you do not need to package the model as a NIM. NIM supports LoRA PEFT adapters trained by the NeMo framework and Hugging Face Transformers libraries for use with NIM supported models. Parameter-Efficient Fine-Tuning (PEFT) methods enable efficient adaptation of large pretrained models to new tasks. When submitting inference requests to the NIM, the server supports dynamic multi-LoRA inference, enabling simultaneous inference requests with different LoRA models. More information can be found here.

Deploying NIM Locally

Can I self-host NIM?

Yes. NIM offers you the ability to build and deploy generative AI in preferred data center infrastructure with flexibility to leverage the cloud.

Once deployed in the data center, NIM microservices are easily accessible to application developers through industry standard APIs.

If I do not currently have GPU infrastructure to self-host NIM, how can I get started?

NVIDIA LaunchPad provides free access to enterprise NVIDIA hardware and software through an internet browser. Users can experience the power of AI with end-to-end solutions through guided hands-on labs or as a development sandbox. Test, prototype, and deploy your own applications and models against the latest and greatest that NVIDIA has to offer.

See the Run NIM Anywhere page for full details on ways to interact with NIM.

Deploying NIM on CSPs

Can I deploy the NIM on CSPs?

Yes. NIM delivers AI foundation models as performance optimized inference microservices portable across public clouds and data centers, accelerating flexible generative AI deployment and time to value.

Check out our NIM-Deploy Github for additional information.

How does NIM work in CSP environments including CSP MLOps platforms and CSP managed Kubernetes solutions?

NIM is containerized and is deployable out of the box on CSP managed Kubernetes solutions like Azure Kubernetes Service (AKS), Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS). A reference Helm chart is available here. NVIDIA NIM is also available in some CSP model gardens. For deploying NIM on CSP MLOps services like Amazon SageMaker, Azure AI Foundry or Vertex AI Studio, an additional ‘shim’ is needed on top of NIM. NVIDIA is continuously collaborating with these partners on NIM integration.

Are there any tips for deploying LLM NIM in CSPs?

For Tensor Parallelism to work, your VMI must support P2P communication. This usually requires you to add a file which outlines the VMIs specific topology. Without this, deployments of a model where tensor parallelism is not set to 1 will fail. An example for AKS can be found on Github.

How can I deploy a NIM on Kubernetes?

NVIDIA maintains a repository of reference implementations for Kubernetes clusters here. In this repository you will find resources such as documentation, Helm charts and Operators. You will also find a reference implementation for Serverless AI inference powered by KServe.