1. I’m seeing an error in the startup logs which read there is a connection error. How can I resolve this?

If you’re reading this in the logs on startup, it is a product of a continual health check which will persist to print out this error until the model has loaded completely. This is not an error as the LLMs may take as little as 10 seconds, up to 20-30 minutes to completely load. You should monitor the startup via the health and liveness endpoints.

2. What is the compatibility between the LLM NIM container and NVIDIA’s pre-built trt-llm engines?

The models will be built for a specific combination of GPU count, and hardware targets. The versioning must also align on the YY.MM. For example, if the container version is 24.04, the model name on NGC MUST also be a 24.04 variant. This is due to the trt-llm version within the NIM container will not be consistent with another release as the LLM NIM container cherry-picks commits from trt-llm to get the best features as soon as possible.

3. Can I use a model generated for a prior container release in the latest container?

No. This is never expected to be supported as there will be no forward or backward compatibility. You should only ever use models and containers which have the same “YY.MM” versioning. This is a requirement, not a suggestion.

4. Can I run more than one LLM within a single container?

No, this is an anti-pattern as the target deployment recommends one LLM per pod/container. Additionally, the GPU ram needed by the LLM is only a portion of the total consumed memory. About 90% of the remaining RAM is allocated by the kv cache to help improve inference performance.

5.When running a model with Tensor Parallelism > 1, I receive “P2P is disabled between connected GPUs 1 and 0.” error. How do I resolve it?

Some cloud providers may disable GPU P2P over PCIe for security reasons. To enable it, you may need to ask your cloud provider. You may also try setting the environment variable: NCCL_P2P_DISABLE=1 which should force the CPUs to communicate over shared memory.