-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA location requirements and documentation #58
Comments
As a counter argument to the symlink strategy, colleagues in the EESSI project noted that the location of the shared objects is different on different distributions. They could be in :
The symlink strategy would therefore need to include some layers, and possibly not work at all on Ubuntu (assuming those |
From our meeting on 20210312:
|
Issue #79 (once tested and implemented) provides a solution for this one. |
We can use it to run software like CUDA without having to install the software. Main doc: https://docs.alliancecan.ca/wiki/Accessing_CVMFS ### Notes on the Compute Canada CUDA library requirements `/usr/lib64/nvidia/libcuda.so` and `/usr/lib64/nvidia/libcuda.so.1` must exist. Otherwise we get error `CUDA driver version is insufficient for CUDA runtime version`. https://docs.alliancecan.ca/wiki/Accessing_CVMFS#CUDA_location https://github.com/ComputeCanada/software-stack-config/blob/a5557c946ca25e2ca41b74716557eb9f5ab5e9c1/lmod/SitePackage.lua#L203-L219 Related issues: ComputeCanada/software-stack#58 ComputeCanada/software-stack#79 This works in ubuntu22.04 on CUDA 12.2 (driver version`535.161.07`): ```bash mkdir /usr/lib64/nvidia cd /usr/lib64/nvidia ln -s ../../lib/x86_64-linux-gnu/libcuda.so . ln -s ../../lib/x86_64-linux-gnu/libcuda.so.1 . ``` Can test this by compiling and running the `vectorAdd` program in cuda-samples: https://github.com/NVIDIA/cuda-samples/tree/3559ca4d088e12db33d6918621cab5c998ccecf1/Samples/0_Introduction/vectorAdd Here's a diff to print out driver and runtime versions: ``` diff --git a/Samples/0_Introduction/vectorAdd/vectorAdd.cu b/Samples/0_Introduction/vectorAdd/vectorAdd.cu index 284b0f0e..3b22df2b 100644 --- a/Samples/0_Introduction/vectorAdd/vectorAdd.cu +++ b/Samples/0_Introduction/vectorAdd/vectorAdd.cu @@ -64,6 +64,30 @@ int main(void) { // Print the vector length to be used, and compute its size int numElements = 50000; size_t size = numElements * sizeof(float); + + int driverVersion = 0, runtimeVersion = 0; + + + cudaError_t error; + + // Get CUDA Driver Version + error = cudaDriverGetVersion(&driverVersion); + printf("cudaDriverGetVersion() - error: %d\n", error); + if (error != cudaSuccess) { + printf("cudaDriverGetVersion error: %d\n", error); + } else { + printf("CUDA Driver Version: %d.%d\n", driverVersion / 1000, (driverVersion % 100) / 10); + } + + // Get CUDA Runtime Version + error = cudaRuntimeGetVersion(&runtimeVersion); + printf("cudaRuntimeGetVersion() - error: %d\n", error); + if (error != cudaSuccess) { + printf("cudaRuntimeGetVersion error: %d\n", error); + } else { + printf("CUDA Runtime Version: %d.%d\n", runtimeVersion / 1000, (runtimeVersion % 100) / 10); + } + printf("[Vector addition of %d elements]\n", numElements); // Allocate the host input vector A ``` When the `/usr/lib64/nvidia/libcuda.so{,.1}` files don't exist, we get: ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 0.0 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)! ``` When everything works properly, we get: ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 12.2 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` Works on older driver versions as well, because CUDA has forward compatibility if the major version is the same. ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 12.0 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` Binaries compiled in this way require `/cvmfs` and `/usr/lib64/nvidia` to work: ``` docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace nvidia/cuda:12.0.0-runtime-ubuntu22.04 /workspace/Samples/0_Introduction/vectorAdd/vectorAdd ``` Actually, those are the only paths (other than a matching base OS image) required for it to work: ``` docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace ubuntu /workspace/Samples/0_Introduction/vectorAdd/vectorAdd ``` Note that `/usr/lib64/nvidia/libcuda.so{,.1}` is a runtime dependency and not a build-time dependency.
For CUDA-enabled software packages, our software environment relies on having driver libraries installed in the path
/usr/lib64/nvidia
as documented on Compute Canada Wiki.The minor versions of the symbolic links we create in
/usr/lib64/nvidia
appear to no longer be needed by the software the install. The removal oflibnvidia-fatbinaryloader.so.$version
in recent CUDA releases is what made us question the need for minor versions.On the presence of minor version @mboisson had this to say:
And the following suggestion
Some serious testing is needed, but it would simplify documentation as the current script we provide on Compute Canada wiki is out of date, and keeping it up-to-date is an ever boring moving target.
The text was updated successfully, but these errors were encountered: