Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA location requirements and documentation #58

Open
cmd-ntrf opened this issue Feb 23, 2021 · 3 comments
Open

CUDA location requirements and documentation #58

cmd-ntrf opened this issue Feb 23, 2021 · 3 comments
Assignees
Labels
discussion needed Issues that require discussion documentation Improvements or additions to documentation enhancement New feature or request

Comments

@cmd-ntrf
Copy link
Member

For CUDA-enabled software packages, our software environment relies on having driver libraries installed in the path /usr/lib64/nvidia as documented on Compute Canada Wiki.

The minor versions of the symbolic links we create in /usr/lib64/nvidia appear to no longer be needed by the software the install. The removal of libnvidia-fatbinaryloader.so.$version in recent CUDA releases is what made us question the need for minor versions.

On the presence of minor version @mboisson had this to say:

We do rely on /usr/lib64/nvidia/libcuda.so.$cuda_version to figure out which version of cuda drivers is installed, to hide or show specific versions of the cuda modules. But this is home made code and could be adjusted, i.e. function get_installed_cuda_driver_version at line 178 of /cvmfs/soft.computecanada.ca/config/lmod/SitePackage.lua

And the following suggestion

If the minor versions of the shared objects are no longer needed, we could just have a directory of symlinks distributed on cvmfs that would include only the major libraries

Some serious testing is needed, but it would simplify documentation as the current script we provide on Compute Canada wiki is out of date, and keeping it up-to-date is an ever boring moving target.

@cmd-ntrf cmd-ntrf added documentation Improvements or additions to documentation enhancement New feature or request labels Feb 23, 2021
@mboisson
Copy link
Member

As a counter argument to the symlink strategy, colleagues in the EESSI project noted that the location of the shared objects is different on different distributions. They could be in :

  • /usr/lib64 (on CentOS7-8)
  • /usr/lib64/nvidia (on older CentOS)
  • /usr/lib/nvidia-xxx (on Ubuntu)

The symlink strategy would therefore need to include some layers, and possibly not work at all on Ubuntu (assuming those xxx are driver versions.

@mboisson mboisson added the discussion needed Issues that require discussion label Feb 23, 2021
@mboisson
Copy link
Member

From our meeting on 20210312:

  • We could have a mix of symlinks (for libraries in /usr/lib64) + LD_LIBRARY_PATH (for when it is installed in different folders)
  • Need to document minimum driver version when the dependency on libnvidia-fatbinaryloader.so.$version was dropped.
  • Need to test OpenGL/EGL libraries
  • It might change the search order (need to verify for name clashes)

@mboisson
Copy link
Member

mboisson commented Jul 21, 2021

Issue #79 (once tested and implemented) provides a solution for this one.

wato-github-automation bot pushed a commit to WATonomous/watcloud-website that referenced this issue Mar 17, 2024
We can use it to run software like CUDA without having to install the
software.

Main doc: https://docs.alliancecan.ca/wiki/Accessing_CVMFS

### Notes on the Compute Canada CUDA library requirements

`/usr/lib64/nvidia/libcuda.so` and `/usr/lib64/nvidia/libcuda.so.1` must
exist. Otherwise we get error `CUDA driver version is insufficient for
CUDA runtime version`.

https://docs.alliancecan.ca/wiki/Accessing_CVMFS#CUDA_location

https://github.com/ComputeCanada/software-stack-config/blob/a5557c946ca25e2ca41b74716557eb9f5ab5e9c1/lmod/SitePackage.lua#L203-L219

Related issues:
ComputeCanada/software-stack#58
ComputeCanada/software-stack#79

This works in ubuntu22.04 on CUDA 12.2 (driver version`535.161.07`):

```bash
mkdir /usr/lib64/nvidia
cd /usr/lib64/nvidia
ln -s ../../lib/x86_64-linux-gnu/libcuda.so .
ln -s ../../lib/x86_64-linux-gnu/libcuda.so.1 .
```

Can test this by compiling and running the `vectorAdd` program in
cuda-samples:
https://github.com/NVIDIA/cuda-samples/tree/3559ca4d088e12db33d6918621cab5c998ccecf1/Samples/0_Introduction/vectorAdd

Here's a diff to print out driver and runtime versions:
```
diff --git a/Samples/0_Introduction/vectorAdd/vectorAdd.cu b/Samples/0_Introduction/vectorAdd/vectorAdd.cu
index 284b0f0e..3b22df2b 100644
--- a/Samples/0_Introduction/vectorAdd/vectorAdd.cu
+++ b/Samples/0_Introduction/vectorAdd/vectorAdd.cu
@@ -64,6 +64,30 @@ int main(void) {
   // Print the vector length to be used, and compute its size
   int numElements = 50000;
   size_t size = numElements * sizeof(float);
+
+  int driverVersion = 0, runtimeVersion = 0;
+
+
+  cudaError_t error;
+
+  // Get CUDA Driver Version
+  error = cudaDriverGetVersion(&driverVersion);
+  printf("cudaDriverGetVersion() - error: %d\n", error);
+  if (error != cudaSuccess) {
+      printf("cudaDriverGetVersion error: %d\n", error);
+  } else {
+      printf("CUDA Driver Version: %d.%d\n", driverVersion / 1000, (driverVersion % 100) / 10);
+  }
+
+  // Get CUDA Runtime Version
+  error = cudaRuntimeGetVersion(&runtimeVersion);
+  printf("cudaRuntimeGetVersion() - error: %d\n", error);
+  if (error != cudaSuccess) {
+      printf("cudaRuntimeGetVersion error: %d\n", error);
+  } else {
+      printf("CUDA Runtime Version: %d.%d\n", runtimeVersion / 1000, (runtimeVersion % 100) / 10);
+  }
+
   printf("[Vector addition of %d elements]\n", numElements);

   // Allocate the host input vector A
```

When the `/usr/lib64/nvidia/libcuda.so{,.1}` files don't exist, we get:
```
cudaDriverGetVersion() - error: 0
CUDA Driver Version: 0.0
cudaRuntimeGetVersion() - error: 0
CUDA Runtime Version: 12.2
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
```

When everything works properly, we get:
```
cudaDriverGetVersion() - error: 0
CUDA Driver Version: 12.2
cudaRuntimeGetVersion() - error: 0
CUDA Runtime Version: 12.2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Works on older driver versions as well, because CUDA has forward
compatibility if the major version is the same.
```
cudaDriverGetVersion() - error: 0
CUDA Driver Version: 12.0
cudaRuntimeGetVersion() - error: 0
CUDA Runtime Version: 12.2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Binaries compiled in this way require `/cvmfs` and `/usr/lib64/nvidia`
to work:
```
docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace nvidia/cuda:12.0.0-runtime-ubuntu22.04 /workspace/Samples/0_Introduction/vectorAdd/vectorAdd
```

Actually, those are the only paths (other than a matching base OS image)
required for it to work:
```
docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace ubuntu /workspace/Samples/0_Introduction/vectorAdd/vectorAdd
```

Note that `/usr/lib64/nvidia/libcuda.so{,.1}` is a runtime dependency
and not a build-time dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion needed Issues that require discussion documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants