Proposal: putting `libcuda.so` in our Gentoo Prefix installation. #79

bartoldeman · 2021-07-06T19:12:33Z

Some important info about the CUDA software stack, and how it could change (subject to testing):

As most of you know we have 3 layers:

the kernel modules (/lib/modules/$(uname -r)/extra/nvidia.ko.xz and related)
the user-mode driver component used to run CUDA applications ( /usr/lib64/nvidia/libcuda.so)
the CUDA toolkit (from module load cuda)

up so far we assumed that 1 & 2 are tightly coupled. But an NVidia employee in the EasyBuild slack clarified they are not, and libcuda.so.1 is forward compatible and the newest libcuda (465.x) is compatible with kernel drivers going all the way back to 418.40.04+.

Note that in fact there are four maintained driver families: the long term support ones (R418, EOL Mar 2022, R450, EOL Jul 2023) and short term ones (R460, EOL Jan 2022, and R465). Béluga and Graham are running an R460 version, Cedar is at R455, which is no longer supported.

So this means that we could put the newest libcuda in cvmfs and the sysadmins only need to worry about the kernel modules. This will need to be tested of course (which we can do via LD_LIBRARY_PATH and/or the cvmfs-dev repo).

Once libcuda is in place all cuda toolkit modules, including 11.3, can then be used on all clusters, irrespective of the kernel driver (as long as it's >= R418.40.04), and the present Lmod check could become obsolete.

As for kernel modules, clusters could consider staying with an R450 version, since with libcuda in cvmfs it no longer needs to be upgraded to 460 to stay compatible with newer CUDA toolkit versions.

see this
https://docs.nvidia.com/datacenter/tesla/drivers/#lifecycle
and this:
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cuda-compatibility-platform

The text was updated successfully, but these errors were encountered:

We can use it to run software like CUDA without having to install the software. Main doc: https://docs.alliancecan.ca/wiki/Accessing_CVMFS ### Notes on the Compute Canada CUDA library requirements `/usr/lib64/nvidia/libcuda.so` and `/usr/lib64/nvidia/libcuda.so.1` must exist. Otherwise we get error `CUDA driver version is insufficient for CUDA runtime version`. https://docs.alliancecan.ca/wiki/Accessing_CVMFS#CUDA_location https://github.com/ComputeCanada/software-stack-config/blob/a5557c946ca25e2ca41b74716557eb9f5ab5e9c1/lmod/SitePackage.lua#L203-L219 Related issues: ComputeCanada/software-stack#58 ComputeCanada/software-stack#79 This works in ubuntu22.04 on CUDA 12.2 (driver version`535.161.07`): ```bash mkdir /usr/lib64/nvidia cd /usr/lib64/nvidia ln -s ../../lib/x86_64-linux-gnu/libcuda.so . ln -s ../../lib/x86_64-linux-gnu/libcuda.so.1 . ``` Can test this by compiling and running the `vectorAdd` program in cuda-samples: https://github.com/NVIDIA/cuda-samples/tree/3559ca4d088e12db33d6918621cab5c998ccecf1/Samples/0_Introduction/vectorAdd Here's a diff to print out driver and runtime versions: ``` diff --git a/Samples/0_Introduction/vectorAdd/vectorAdd.cu b/Samples/0_Introduction/vectorAdd/vectorAdd.cu index 284b0f0e..3b22df2b 100644 --- a/Samples/0_Introduction/vectorAdd/vectorAdd.cu +++ b/Samples/0_Introduction/vectorAdd/vectorAdd.cu @@ -64,6 +64,30 @@ int main(void) { // Print the vector length to be used, and compute its size int numElements = 50000; size_t size = numElements * sizeof(float); + + int driverVersion = 0, runtimeVersion = 0; + + + cudaError_t error; + + // Get CUDA Driver Version + error = cudaDriverGetVersion(&driverVersion); + printf("cudaDriverGetVersion() - error: %d\n", error); + if (error != cudaSuccess) { + printf("cudaDriverGetVersion error: %d\n", error); + } else { + printf("CUDA Driver Version: %d.%d\n", driverVersion / 1000, (driverVersion % 100) / 10); + } + + // Get CUDA Runtime Version + error = cudaRuntimeGetVersion(&runtimeVersion); + printf("cudaRuntimeGetVersion() - error: %d\n", error); + if (error != cudaSuccess) { + printf("cudaRuntimeGetVersion error: %d\n", error); + } else { + printf("CUDA Runtime Version: %d.%d\n", runtimeVersion / 1000, (runtimeVersion % 100) / 10); + } + printf("[Vector addition of %d elements]\n", numElements); // Allocate the host input vector A ``` When the `/usr/lib64/nvidia/libcuda.so{,.1}` files don't exist, we get: ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 0.0 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)! ``` When everything works properly, we get: ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 12.2 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` Works on older driver versions as well, because CUDA has forward compatibility if the major version is the same. ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 12.0 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` Binaries compiled in this way require `/cvmfs` and `/usr/lib64/nvidia` to work: ``` docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace nvidia/cuda:12.0.0-runtime-ubuntu22.04 /workspace/Samples/0_Introduction/vectorAdd/vectorAdd ``` Actually, those are the only paths (other than a matching base OS image) required for it to work: ``` docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace ubuntu /workspace/Samples/0_Introduction/vectorAdd/vectorAdd ``` Note that `/usr/lib64/nvidia/libcuda.so{,.1}` is a runtime dependency and not a build-time dependency.

mboisson mentioned this issue Jul 21, 2021

CUDA location requirements and documentation #58

Open

bartoldeman mentioned this issue Jul 11, 2022

Introduce cuda_driver_library_available(cuda_version_two_digits) ComputeCanada/software-stack-config#37

Merged

mboisson mentioned this issue Jul 22, 2024

CUDA driver check mechanism ComputeCanada/software-stack-config#92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: putting `libcuda.so` in our Gentoo Prefix installation. #79

Proposal: putting `libcuda.so` in our Gentoo Prefix installation. #79

bartoldeman commented Jul 6, 2021

Proposal: putting libcuda.so in our Gentoo Prefix installation. #79

Proposal: putting libcuda.so in our Gentoo Prefix installation. #79

Comments

bartoldeman commented Jul 6, 2021

Proposal: putting `libcuda.so` in our Gentoo Prefix installation. #79

Proposal: putting `libcuda.so` in our Gentoo Prefix installation. #79