Deployment

1. What are the hardware and software prerequisites for self-host NIM?

Since NIMs are categorized by model family and a per model basis, please refer to specific NIM documentation based upon your use case.

2. What is an NGC API key?

As you interact with the NGC repository from a terminal session, you must use an API key to pull locked container images. The API key is unique to you and tied to your account. Here is how to generate a NGC API Key or Personal Key.

📒NOTE: Personal keys allow you to configure an expiration date, revoke or delete the key using an action button, and rotate the key as needed. For more information about key types, please refer the NGC User Guide.

❗️Important: Keep your key secret and in a safe place. Do not share it or store it in a place where others can see or copy it.

3. What is an NGC CLI?

NVIDIA NGC CLI is a command-line interface tool for managing Docker containers in the NVIDIA NGC Registry. With NGC CLI, you can perform the same operations that are available from the NGC website, such as viewing Docker repositories.

4. How will the deployment of NIMs be optimized based on hardware?

Each NIM is its own Docker container with a model, such as meta/llama3-8b-instruct.These containers include the runtime capable of running the model on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. The NIM automatically downloads the model from NGC, leveraging a local system cache if available. Each NIM is built from a common base, so once a NIM has been downloaded, downloading additional NIMs is extremely fast.

When a NIM is first deployed, NIM inspects the local hardware configuration, and the available model versions available in the model registry, and automatically chooses the best version of the model for the available hardware. For a subset of NVIDIA GPUs (see Support Matrix), NIM downloads the optimized TRT engine and runs an inference using the TRT-LLM library. For all other NVIDIA GPUs, NIM downloads a non-optimized model and runs it using the vLLM library.

5. Can I deploy the LLM NIM on CSPs?

Yes. NIM delivers AI foundation models as performance optimized inference microservices portable across public clouds and datacenters, accelerating flexible generative AI deployment and time to value.

Developers need access to a broad range of foundation models and deployment options to meet specific enterprise application needs. In addition, preparing and maintaining models for performance optimized inference across a variety of infrastructure is complex and time consuming, hindering the path to production and scale. NIM provides access to a broad range of foundation models, available immediately as production-ready, performance optimized, portable inference microservices. Whether integrated with cloud AI developer tools or Infrastructure-as-a-Service, NIM microservices accelerate the path to flexible, performance optimized generative AI application deployment for customers, supporting a long tail of accelerated inference workloads in public clouds.

Check out our NIM-Deploy Github for additional information.

6. Can I self-host NIM?

Yes. NIM offers you the ability to build and deploy generative AI in preferred datacenter infrastructure with flexibility to leverage the cloud. NIM delivers AI foundation models as accelerated inference microservices portable across datacenter, workstation, and cloud, accelerating flexible generative AI development, deployment and time to value.

OEM systems with NIM gives IT leaders access to a broad range of AI foundation models and production-ready, portable inference microservices, delivered and maintained as enterprise-grade software through NVIDIA AI Enterprise. Once deployed in the datacenter, NIM microservices are easily accessible to application developers through industry standard APIs. Customers can avoid the costly overhead of model training, optimization for inference and software maintenance, speeding time to deployment and value for accelerated generative AI workloads in the datacenter, with the flexibility to leverage cloud.

7. If I do not currently have GPU infrastructure to self-host NIM, how can I get started?

NVIDIA LaunchPad provides free access to enterprise NVIDIA hardware and software through an internet browser. Users can experience the power of AI with end-to-end solutions through guided hands-on labs or as a development sandbox. Test, prototype, and deploy your own applications and models against the latest and greatest that NVIDIA has to offer.

8. Given that many AI microservices are relatively new, how much consistency is there in the APIs?

OpenAPI has been the industry standard; it has set the standard for Embedding, but for the rest of the microservices, NVIDIA has set the standard. NVIDIA has invested in much research and thoughtful customization to align to industry standards. Every model has different parameters, and NVIDIA is making sure to adapt to those standards.

9. Does NIM support API_KEY authentication, like OpenAI?

No. Generally speaking, this would be a service mesh. A service mesh addresses the challenges of distributed microservices architecture through the establishment of rules which allow/deny access to any part of a microservice. The Helm chart which NVIDIA provides is compatible with a service mesh but does not install one.

10. How will the deployment of NIM be optimized based on hardware?

The system automatically chooses the best backend based on the hardware present, aiming to maximize performance and ease of use.

11. Are there any tips for deploying LLM NIM in CSPs?

For Tensor Parallelism to work, your VMI must support P2P communication. This usually means adding a file which outlines the VMIs specific topology. Without this, deployments of a model where tensor parallelism is not set to 1 will fail. An example for AKS can be found on Github.

12. In terms of Operations, does the container expose an OpenTelemetry/Prometheus port/ endpoint for monitoring?

Yes, metrics are served on, by default. There are also examples on how to use the metrics with an example dashboard for Prometheus here.

13. If I customize a model, how can I get it packaged as a NIM?

Upon customizing a NIM supported model using a NIM supported fine-tuning method, you do not need to package the model as a NIM. NIM supports LoRA PEFT adapters trained by the NeMo framework and Hugging Face Transformers libraries for use with NIM supported models. Parameter-Efficient Fine-Tuning (PEFT) methods enable
efficient adaptation of large pretrained models to new tasks. When submitting inference requests to the NIM, the server supports dynamic multi-LoRA inference, enabling simultaneous inference requests with different LoRA models. More information can be found here.

14. How can I deploy a NIM on Kubernetes?

NVIDIA maintains a repository of reference implementations for Kubernetes clustershere. In this repository you will find resources such as documentation, Helm charts and Operators. You will also find a reference implementation for serverless AI inferences powered by Kserve.