Product

1. What is NVIDIA NIM?

NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of easy-to-use microservices designed to accelerate deployment of generative AI across your enterprise. These prebuilt containers support a broad spectrum of AI models—from open-source community models to NVIDIA AI Foundation models, as well as custom AI models. NIM microservices are deployed with a single command for easy integration into enterprise-grade AI applications using standard APIs and just a few lines of code. Built on robust foundations including inference engines like Triton Inference Server, TensorRT, TensorRT-LLM, and PyTorch, NIM is engineered to facilitate seamless AI inferencing at scale, ensuring that you can deploy AI applications anywhere with confidence. Whether on-premises or in the cloud, NIM is the fastest way to achieve accelerated generative AI inference at scale.

2. What are the benefits of NIM?

  • Maintain security and control of generative AI applications and data with
    self-hosted deployment of the latest AI models in your choice of infrastructure,
    on-premises or in the cloud.
  • Speed time to market with prebuilt, cloud-native microservices that are
    continuously maintained to deliver optimized inference on NVIDIA accelerated
    infrastructure.
  • Empower enterprise developers with industry standard APIs and tools tailored for
    Enterprise environments.
  • Improve TCO with low latency, high throughput AI inference that scales with cloud
    Achieve best accuracy with support for pre-tuned models out of the box.
  • Leverage enterprise-grade software with dedicated feature branches, rigorous
    validation processes, and support including direct access to NVIDIA AI experts and
    defined service-level agreements.

3. How do I get started with NVIDIA NIM?

To get started, users can experience accelerated generative AI models in NVIDIA’s API catalog. Here they can interact with the latest NVIDIA AI Foundation Models through a browser and build POCs with model APIs. After prototyping is complete, users often want to transition AI models over to their own compute environment, mitigating the risk of data IP leakage, and fine-tuning a model. Models from NVIDIA’s API catalog can be downloaded for self-hosting with NVIDIA NIM, included with NVIDIA AI Enterprise, giving enterprise developers ownership of their customizations, infrastructure choices, and full control of their IP and AI application.

4. Are self-hosted NIMs only compatible with accelerated infrastructure (GPU’s)

Yes, NIM is designed to run on CUDA infrastructure that is a NVIDIA-Certified System.

5. What if I do not currently have NVIDIA-Certified System, is there another way to try out NIM?

If you do not have available GPU infrastructure, check out NVIDIA LaunchPad. Additional documentation on NVIDIA-Certified Systems can be found here.

6. How do I get started evaluating and deploying AI models?

You can get started by visiting build.nvidia.com where you can discover the latest AI models and learn about NVIDIA NIM. Then, you can explore and interact with more AI models through the browser or sign up for free credits to access NVIDIA-hosted endpoints for application prototyping in the NVIDIA API catalog. To deploy AI models on your preferred NVIDIA accelerated infrastructure, you will be prompted through interacting with downloadable models at ai.nvidia.com or the API catalog to sign up for an NVIDIA AI Enterprise 90-day evaluation license.

7. I signed up at build.nvidia.com and now have credits for API calls, how do I use them?

API call credits are not deducted when interacting with models on build.nvidia.com through the browser. Remote API calls to NVIDIA-hosted endpoints count against trail API credits.

8. What is contained within a NIM?

Each NIM is its own Docker container with a model, such as meta/llama3-8b-instruct, and
the runtime capable of running the model on any NVIDIA GPU.

NIM containers include:

  • Optimized AI models
  • APIs conforming to domain-specific industry standards
  • Optimized inference engines

9. What is the value of using NIM microservices?

NVIDIA NIM is NVIDIA’s opinionated view of building enterprise LLM applications.

10. What is the pricing model for NIM?

NIM is available through a license of NVIDIA AI Enterprise for $4500 per GPU per year or $1 per GPU per hour in the cloud. Pricing is based on the number of GPUs, not the number of NIMs.

11. Regarding the licensing, is it flat or does it change based on the size of the GPU?

Pricing is part of an NVIDIA AI Enterprise pricing structure, not a license per NIM.

12. What does NVIDIA support in regard to NIM through NVIDIA AI Enterprise?

NVIDIA AI Enterprise supports the optimized inference engine and runtime of the container. It does not support what is generated by the models or the models themselves. There is too much variance in terms of the sources of the model or data incorporated through RAG for NVIDIA to assume responsibility.

13. How does NIM work in CSP environments including CSP MLOps (Sagemaker, Azure AI Studio, Vertex AI) and CSP managed Kubernetes Solutions (EKS, AKS and GKE)?

NIM is containerized and is deployable out of the box on CSP managed Kubernetes solutions like AKS, GKE, EKS. A Reference helm Chart is available here. For deploying NIM on CSP MLOps like Sagemaker, Azure AI Studio, Vertex AI, an additional ‘shim’ is needed on top of NIM. NVIDIA is continuing to collaborate with these partners on NIM integration.

Deployment Questions

1. What are the hardware and software prerequisites for self-host NIM?

Since NIMs are categorized by model family and a per model basis, please refer to specific NIM documentation based upon your use case.

2. What is an NGC API key?

As you interact with the NGC repository from a terminal session, you must use an API key to pull locked container images. The API key is unique to you and tied to your account. Here is how to generate a NGC API Key or Personal Key.

📒NOTE: Personal keys allow you to configure an expiration date, revoke or delete the key using an action button, and rotate the key as needed. For more information about key types, please refer the NGC User Guide.

❗️Important: Keep your key secret and in a safe place. Do not share it or store it in a place where others can see or copy it.

3. What is an NGC CLI?

NVIDIA NGC CLI is a command-line interface tool for managing Docker containers in the NVIDIA NGC Registry. With NGC CLI, you can perform the same operations that are available from the NGC website, such as viewing Docker repositories.

4. How will the deployment of NIMs be optimized based on hardware?

Each NIM is its own Docker container with a model, such as meta/llama3-8b-instruct.These containers include the runtime capable of running the model on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. The NIM automatically downloads the model from NGC, leveraging a local system cache if available. Each NIM is built from a common base, so once a NIM has been downloaded, downloading additional NIMs is extremely fast.

When a NIM is first deployed, NIM inspects the local hardware configuration, and the available model versions available in the model registry, and automatically chooses the best version of the model for the available hardware. For a subset of NVIDIA GPUs (see Support Matrix), NIM downloads the optimized TRT engine and runs an inference using the TRT-LLM library. For all other NVIDIA GPUs, NIM downloads a non-optimized model and runs it using the vLLM library.

5. Can I deploy the LLM NIM on CSPs?

Yes. NIM delivers AI foundation models as performance optimized inference microservices portable across public clouds and datacenters, accelerating flexible generative AI deployment and time to value.

Developers need access to a broad range of foundation models and deployment options to meet specific enterprise application needs. In addition, preparing and maintaining models for performance optimized inference across a variety of infrastructure is complex and time consuming, hindering the path to production and scale. NIM provides access to a broad range of foundation models, available immediately as production-ready, performance optimized, portable inference microservices. Whether integrated with cloud AI developer tools or Infrastructure-as-a-Service, NIM microservices accelerate the path to flexible, performance optimized generative AI application deployment for customers, supporting a long tail of accelerated inference workloads in public clouds.

Check out our NIM-Deploy Github for additional information.

6. Can I self-host NIM?

Yes. NIM offers you the ability to build and deploy generative AI in preferred datacenter infrastructure with flexibility to leverage the cloud. NIM delivers AI foundation models as accelerated inference microservices portable across datacenter, workstation, and cloud, accelerating flexible generative AI development, deployment and time to value.

OEM systems with NIM gives IT leaders access to a broad range of AI foundation models and production-ready, portable inference microservices, delivered and maintained as enterprise-grade software through NVIDIA AI Enterprise. Once deployed in the datacenter, NIM microservices are easily accessible to application developers through industry standard APIs. Customers can avoid the costly overhead of model training, optimization for inference and software maintenance, speeding time to deployment and value for accelerated generative AI workloads in the datacenter, with the flexibility to leverage cloud.

7. If I do not currently have GPU infrastructure to self-host NIM, how can I get started?

NVIDIA LaunchPad provides free access to enterprise NVIDIA hardware and software through an internet browser. Users can experience the power of AI with end-to-end solutions through guided hands-on labs or as a development sandbox. Test, prototype, and deploy your own applications and models against the latest and greatest that NVIDIA has to offer.

8. Given that many AI microservices are relatively new, how much consistency is there in the APIs?

OpenAPI has been the industry standard; it has set the standard for Embedding, but for the rest of the microservices, NVIDIA has set the standard. NVIDIA has invested in much research and thoughtful customization to align to industry standards. Every model has different parameters, and NVIDIA is making sure to adapt to those standards.

9. Does NIM support API_KEY authentication, like OpenAI?

No. Generally speaking, this would be a service mesh. A service mesh addresses the challenges of distributed microservices architecture through the establishment of rules which allow/deny access to any part of a microservice. The Helm chart which NVIDIA provides is compatible with a service mesh but does not install one.

10. How will the deployment of NIM be optimized based on hardware?

The system automatically chooses the best backend based on the hardware present, aiming to maximize performance and ease of use.

11. Are there any tips for deploying LLM NIM in CSPs?

For Tensor Parallelism to work, your VMI must support P2P communication. This usually means adding a file which outlines the VMIs specific topology. Without this, deployments of a model where tensor parallelism is not set to 1 will fail. An example for AKS can be found on Github.

12. In terms of Operations, does the container expose an OpenTelemetry/Prometheus port/ endpoint for monitoring?

Yes, metrics are served on, by default. There are also examples on how to use the metrics with an example dashboard for Prometheus here.

13. If I customize a model, how can I get it packaged as a NIM?

Upon customizing a NIM supported model using a NIM supported fine-tuning method, you do not need to package the model as a NIM. NIM supports LoRA PEFT adapters trained by the NeMo framework and Hugging Face Transformers libraries for use with NIM supported models. Parameter-Efficient Fine-Tuning (PEFT) methods enable
efficient adaptation of large pretrained models to new tasks. When submitting inference requests to the NIM, the server supports dynamic multi-LoRA inference, enabling simultaneous inference requests with different LoRA models. More information can be found here.

14. How can I deploy a NIM on Kubernetes?

NVIDIA maintains a repository of reference implementations for Kubernetes clustershere. In this repository you will find resources such as documentation, Helm charts and Operators. You will also find a reference implementation for serverless AI inferences powered by Kserve.

Debugging Questions

1. I’m seeing an error in the startup logs which read there is a connection error. How can I resolve this?

If you’re reading this in the logs on startup, it is a product of a continual health check which will persist to print out this error until the model has loaded completely. This is not an error as the LLMs may take as little as 10 seconds, up to 20-30 minutes to completely load. You should monitor the startup via the health and liveness endpoints.

2. What is the compatibility between the LLM NIM container and NVIDIA’s pre-built trt-llm engines?

The models will be built for a specific combination of GPU count, and hardware targets. The versioning must also align on the YY.MM. For example, if the container version is 24.04, the model name on NGC MUST also be a 24.04 variant. This is due to the trt-llm version within the NIM container will not be consistent with another release as the LLM NIM container cherry-picks commits from trt-llm to get the best features as soon as possible.

3. Can I use a model generated for a prior container release in the latest container?

No. This is never expected to be supported as there will be no forward or backward compatibility. You should only ever use models and containers which have the same “YY.MM” versioning. This is a requirement, not a suggestion.

4. Can I run more than one LLM within a single container?

No, this is an anti-pattern as the target deployment recommends one LLM per pod/container. Additionally, the GPU ram needed by the LLM is only a portion of the total consumed memory. About 90% of the remaining RAM is allocated by the kv cache to help improve inference performance.

5.When running a model with Tensor Parallelism > 1, I receive “P2P is disabled between connected GPUs 1 and 0.” error. How do I resolve it?

Some cloud providers may disable GPU P2P over PCIe for security reasons. To enable it, you may need to ask your cloud provider. You may also try setting the environment variable: NCCL_P2P_DISABLE=1 which should force the CPUs to communicate over shared memory.


What’s Next