Debugging FAQ

I’m seeing a connection error in the startup logs. How can I resolve this?

If you’re seeing connection error in the logs on startup, it is a product of a continual health check which will persist to print out this error until the model has loaded completely. Some models take up to 30 minutes to completely load. You should monitor the startup via the health and liveness endpoints.

What is the compatibility between the LLM NIM container and NVIDIA’s pre-built TensorRT-LLM engines?

NIM are built for a specific combination of GPU count and hardware targets. The versioning must also align on the YY.MM. For example, if the container version is 24.04, the model name on NGC must also be a 24.04 variant. If this is not followed, the TRT-LLM version within the NIM container will not be consistent with another release as the NIM container cherry-picks commits from TRT-LLM to get the best features as soon as possible.

Can I use a model generated for a prior container release in the latest container?

No. You must use models and containers which have the same YY.MM versioning as there is no forward or backwards compatibility.

When running a model with Tensor Parallelism > 1, I receive an error relating to GPU Connection How do I resolve it?

If you're running with Tensor Parallelism > 1 and see the error “P2P is disabled between connected GPUs 1 and 0.”, you may need to enable GPU P2P.

Some cloud providers may disable GPU P2P over PCIe for security reasons.
To enable it, you may need to ask your cloud provider directly. You can also try setting the environment variable: NCCL_P2P_DISABLE=1 which should force the CPUs to communicate over shared memory.