Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: putting libcuda.so in our Gentoo Prefix installation. #79

Open
bartoldeman opened this issue Jul 6, 2021 · 0 comments
Open

Comments

@bartoldeman
Copy link
Contributor

Some important info about the CUDA software stack, and how it could change (subject to testing):

As most of you know we have 3 layers:

  1. the kernel modules (/lib/modules/$(uname -r)/extra/nvidia.ko.xz and related)
  2. the user-mode driver component used to run CUDA applications ( /usr/lib64/nvidia/libcuda.so)
  3. the CUDA toolkit (from module load cuda)

up so far we assumed that 1 & 2 are tightly coupled. But an NVidia employee in the EasyBuild slack clarified they are not, and libcuda.so.1 is forward compatible and the newest libcuda (465.x) is compatible with kernel drivers going all the way back to 418.40.04+.

Note that in fact there are four maintained driver families: the long term support ones (R418, EOL Mar 2022, R450, EOL Jul 2023) and short term ones (R460, EOL Jan 2022, and R465). Béluga and Graham are running an R460 version, Cedar is at R455, which is no longer supported.

So this means that we could put the newest libcuda in cvmfs and the sysadmins only need to worry about the kernel modules. This will need to be tested of course (which we can do via LD_LIBRARY_PATH and/or the cvmfs-dev repo).

Once libcuda is in place all cuda toolkit modules, including 11.3, can then be used on all clusters, irrespective of the kernel driver (as long as it's >= R418.40.04), and the present Lmod check could become obsolete.

As for kernel modules, clusters could consider staying with an R450 version, since with libcuda in cvmfs it no longer needs to be upgraded to 460 to stay compatible with newer CUDA toolkit versions.

see this
https://docs.nvidia.com/datacenter/tesla/drivers/#lifecycle
and this:
https://docs.nvidia.com/deploy/cuda-compatibility/index.html#cuda-compatibility-platform

wato-github-automation bot pushed a commit to WATonomous/watcloud-website that referenced this issue Mar 17, 2024
We can use it to run software like CUDA without having to install the
software.

Main doc: https://docs.alliancecan.ca/wiki/Accessing_CVMFS

### Notes on the Compute Canada CUDA library requirements

`/usr/lib64/nvidia/libcuda.so` and `/usr/lib64/nvidia/libcuda.so.1` must
exist. Otherwise we get error `CUDA driver version is insufficient for
CUDA runtime version`.

https://docs.alliancecan.ca/wiki/Accessing_CVMFS#CUDA_location

https://github.com/ComputeCanada/software-stack-config/blob/a5557c946ca25e2ca41b74716557eb9f5ab5e9c1/lmod/SitePackage.lua#L203-L219

Related issues:
ComputeCanada/software-stack#58
ComputeCanada/software-stack#79

This works in ubuntu22.04 on CUDA 12.2 (driver version`535.161.07`):

```bash
mkdir /usr/lib64/nvidia
cd /usr/lib64/nvidia
ln -s ../../lib/x86_64-linux-gnu/libcuda.so .
ln -s ../../lib/x86_64-linux-gnu/libcuda.so.1 .
```

Can test this by compiling and running the `vectorAdd` program in
cuda-samples:
https://github.com/NVIDIA/cuda-samples/tree/3559ca4d088e12db33d6918621cab5c998ccecf1/Samples/0_Introduction/vectorAdd

Here's a diff to print out driver and runtime versions:
```
diff --git a/Samples/0_Introduction/vectorAdd/vectorAdd.cu b/Samples/0_Introduction/vectorAdd/vectorAdd.cu
index 284b0f0e..3b22df2b 100644
--- a/Samples/0_Introduction/vectorAdd/vectorAdd.cu
+++ b/Samples/0_Introduction/vectorAdd/vectorAdd.cu
@@ -64,6 +64,30 @@ int main(void) {
   // Print the vector length to be used, and compute its size
   int numElements = 50000;
   size_t size = numElements * sizeof(float);
+
+  int driverVersion = 0, runtimeVersion = 0;
+
+
+  cudaError_t error;
+
+  // Get CUDA Driver Version
+  error = cudaDriverGetVersion(&driverVersion);
+  printf("cudaDriverGetVersion() - error: %d\n", error);
+  if (error != cudaSuccess) {
+      printf("cudaDriverGetVersion error: %d\n", error);
+  } else {
+      printf("CUDA Driver Version: %d.%d\n", driverVersion / 1000, (driverVersion % 100) / 10);
+  }
+
+  // Get CUDA Runtime Version
+  error = cudaRuntimeGetVersion(&runtimeVersion);
+  printf("cudaRuntimeGetVersion() - error: %d\n", error);
+  if (error != cudaSuccess) {
+      printf("cudaRuntimeGetVersion error: %d\n", error);
+  } else {
+      printf("CUDA Runtime Version: %d.%d\n", runtimeVersion / 1000, (runtimeVersion % 100) / 10);
+  }
+
   printf("[Vector addition of %d elements]\n", numElements);

   // Allocate the host input vector A
```

When the `/usr/lib64/nvidia/libcuda.so{,.1}` files don't exist, we get:
```
cudaDriverGetVersion() - error: 0
CUDA Driver Version: 0.0
cudaRuntimeGetVersion() - error: 0
CUDA Runtime Version: 12.2
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
```

When everything works properly, we get:
```
cudaDriverGetVersion() - error: 0
CUDA Driver Version: 12.2
cudaRuntimeGetVersion() - error: 0
CUDA Runtime Version: 12.2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Works on older driver versions as well, because CUDA has forward
compatibility if the major version is the same.
```
cudaDriverGetVersion() - error: 0
CUDA Driver Version: 12.0
cudaRuntimeGetVersion() - error: 0
CUDA Runtime Version: 12.2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Binaries compiled in this way require `/cvmfs` and `/usr/lib64/nvidia`
to work:
```
docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace nvidia/cuda:12.0.0-runtime-ubuntu22.04 /workspace/Samples/0_Introduction/vectorAdd/vectorAdd
```

Actually, those are the only paths (other than a matching base OS image)
required for it to work:
```
docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace ubuntu /workspace/Samples/0_Introduction/vectorAdd/vectorAdd
```

Note that `/usr/lib64/nvidia/libcuda.so{,.1}` is a runtime dependency
and not a build-time dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant