-
Notifications
You must be signed in to change notification settings - Fork 245
Description
We're running into an issue running builds and integration tests in containers on kubernetes using Nvidia Operator:
/usr/bin/ld: cannot find -lnvidia-ml: No such file or directory
This error can be reproduced with gcc -lnvidia-ml
On investigation it's because libnvidia-container is making the nvidia libraries and drivers available in the container, but not creating the libnvidia-ml.so -> libnvidia-ml.so.1
symlink. e.g.:
shared_ci_bot@runner-nagp1soyw-project-9373-concurrent-0-pyktetzc:/$ ls -la /usr/lib/x86_64-linux-gnu/libnvidia-ml*
lrwxrwxrwx 1 root root 26 Jan 17 12:27 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -> libnvidia-ml.so.550.127.05
-rwxr-xr-x 1 root root 2078360 Jan 16 12:37 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.127.05
Creating the symlink manually resolves the issue.
I see in https://github.com/NVIDIA/libnvidia-container/blob/main/src/nvc_mount.c there is a workaround to create symlinks for libcuda.so and a few others. Can the same be done for libnvidia-ml.so ?
See https://docs.nvidia.com/deploy/pdf/NVML_API_Reference_Guide.pdf Chapter 1 Page 2 for reference that it should be linked this way:
On Linux the NVML library is named "libnvidia-ml.so" and can be found on the standard library path. To link against the NVML library add the -lnvidia-ml flag to your linker command.