CUDA location requirements and documentation #58

cmd-ntrf · 2021-02-23T15:39:58Z

For CUDA-enabled software packages, our software environment relies on having driver libraries installed in the path /usr/lib64/nvidia as documented on Compute Canada Wiki.

The minor versions of the symbolic links we create in /usr/lib64/nvidia appear to no longer be needed by the software the install. The removal of libnvidia-fatbinaryloader.so.$version in recent CUDA releases is what made us question the need for minor versions.

On the presence of minor version @mboisson had this to say:

We do rely on /usr/lib64/nvidia/libcuda.so.$cuda_version to figure out which version of cuda drivers is installed, to hide or show specific versions of the cuda modules. But this is home made code and could be adjusted, i.e. function get_installed_cuda_driver_version at line 178 of /cvmfs/soft.computecanada.ca/config/lmod/SitePackage.lua

And the following suggestion

If the minor versions of the shared objects are no longer needed, we could just have a directory of symlinks distributed on cvmfs that would include only the major libraries

Some serious testing is needed, but it would simplify documentation as the current script we provide on Compute Canada wiki is out of date, and keeping it up-to-date is an ever boring moving target.

The text was updated successfully, but these errors were encountered:

mboisson · 2021-02-23T20:06:36Z

As a counter argument to the symlink strategy, colleagues in the EESSI project noted that the location of the shared objects is different on different distributions. They could be in :

/usr/lib64 (on CentOS7-8)
/usr/lib64/nvidia (on older CentOS)
/usr/lib/nvidia-xxx (on Ubuntu)

The symlink strategy would therefore need to include some layers, and possibly not work at all on Ubuntu (assuming those xxx are driver versions.

mboisson · 2021-03-12T20:30:36Z

From our meeting on 20210312:

We could have a mix of symlinks (for libraries in /usr/lib64) + LD_LIBRARY_PATH (for when it is installed in different folders)
Need to document minimum driver version when the dependency on libnvidia-fatbinaryloader.so.$version was dropped.
Need to test OpenGL/EGL libraries
It might change the search order (need to verify for name clashes)

mboisson · 2021-07-21T21:48:28Z

Issue #79 (once tested and implemented) provides a solution for this one.

We can use it to run software like CUDA without having to install the software. Main doc: https://docs.alliancecan.ca/wiki/Accessing_CVMFS ### Notes on the Compute Canada CUDA library requirements `/usr/lib64/nvidia/libcuda.so` and `/usr/lib64/nvidia/libcuda.so.1` must exist. Otherwise we get error `CUDA driver version is insufficient for CUDA runtime version`. https://docs.alliancecan.ca/wiki/Accessing_CVMFS#CUDA_location https://github.com/ComputeCanada/software-stack-config/blob/a5557c946ca25e2ca41b74716557eb9f5ab5e9c1/lmod/SitePackage.lua#L203-L219 Related issues: ComputeCanada/software-stack#58 ComputeCanada/software-stack#79 This works in ubuntu22.04 on CUDA 12.2 (driver version`535.161.07`): ```bash mkdir /usr/lib64/nvidia cd /usr/lib64/nvidia ln -s ../../lib/x86_64-linux-gnu/libcuda.so . ln -s ../../lib/x86_64-linux-gnu/libcuda.so.1 . ``` Can test this by compiling and running the `vectorAdd` program in cuda-samples: https://github.com/NVIDIA/cuda-samples/tree/3559ca4d088e12db33d6918621cab5c998ccecf1/Samples/0_Introduction/vectorAdd Here's a diff to print out driver and runtime versions: ``` diff --git a/Samples/0_Introduction/vectorAdd/vectorAdd.cu b/Samples/0_Introduction/vectorAdd/vectorAdd.cu index 284b0f0e..3b22df2b 100644 --- a/Samples/0_Introduction/vectorAdd/vectorAdd.cu +++ b/Samples/0_Introduction/vectorAdd/vectorAdd.cu @@ -64,6 +64,30 @@ int main(void) { // Print the vector length to be used, and compute its size int numElements = 50000; size_t size = numElements * sizeof(float); + + int driverVersion = 0, runtimeVersion = 0; + + + cudaError_t error; + + // Get CUDA Driver Version + error = cudaDriverGetVersion(&driverVersion); + printf("cudaDriverGetVersion() - error: %d\n", error); + if (error != cudaSuccess) { + printf("cudaDriverGetVersion error: %d\n", error); + } else { + printf("CUDA Driver Version: %d.%d\n", driverVersion / 1000, (driverVersion % 100) / 10); + } + + // Get CUDA Runtime Version + error = cudaRuntimeGetVersion(&runtimeVersion); + printf("cudaRuntimeGetVersion() - error: %d\n", error); + if (error != cudaSuccess) { + printf("cudaRuntimeGetVersion error: %d\n", error); + } else { + printf("CUDA Runtime Version: %d.%d\n", runtimeVersion / 1000, (runtimeVersion % 100) / 10); + } + printf("[Vector addition of %d elements]\n", numElements); // Allocate the host input vector A ``` When the `/usr/lib64/nvidia/libcuda.so{,.1}` files don't exist, we get: ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 0.0 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)! ``` When everything works properly, we get: ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 12.2 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` Works on older driver versions as well, because CUDA has forward compatibility if the major version is the same. ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 12.0 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` Binaries compiled in this way require `/cvmfs` and `/usr/lib64/nvidia` to work: ``` docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace nvidia/cuda:12.0.0-runtime-ubuntu22.04 /workspace/Samples/0_Introduction/vectorAdd/vectorAdd ``` Actually, those are the only paths (other than a matching base OS image) required for it to work: ``` docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace ubuntu /workspace/Samples/0_Introduction/vectorAdd/vectorAdd ``` Note that `/usr/lib64/nvidia/libcuda.so{,.1}` is a runtime dependency and not a build-time dependency.

cmd-ntrf added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 23, 2021

cmd-ntrf mentioned this issue Feb 23, 2021

Generalize VGPU drivers installation ComputeCanada/puppet-magic_castle#93

Merged

mboisson assigned mboisson and bartoldeman Feb 23, 2021

mboisson added the discussion needed Issues that require discussion label Feb 23, 2021

mboisson mentioned this issue Jul 22, 2024

CUDA driver check mechanism ComputeCanada/software-stack-config#92

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA location requirements and documentation #58

CUDA location requirements and documentation #58

cmd-ntrf commented Feb 23, 2021

mboisson commented Feb 23, 2021

mboisson commented Mar 12, 2021

mboisson commented Jul 21, 2021 •

edited

Loading

CUDA location requirements and documentation #58

CUDA location requirements and documentation #58

Comments

cmd-ntrf commented Feb 23, 2021

mboisson commented Feb 23, 2021

mboisson commented Mar 12, 2021

mboisson commented Jul 21, 2021 • edited Loading

mboisson commented Jul 21, 2021 •

edited

Loading