Mount CVMFS on SLURM compute nodes (#2477)

We can use it to run software like CUDA without having to install the software. Main doc: https://docs.alliancecan.ca/wiki/Accessing_CVMFS ### Notes on the Compute Canada CUDA library requirements `/usr/lib64/nvidia/libcuda.so` and `/usr/lib64/nvidia/libcuda.so.1` must exist. Otherwise we get error `CUDA driver version is insufficient for CUDA runtime version`. https://docs.alliancecan.ca/wiki/Accessing_CVMFS#CUDA_location https://github.com/ComputeCanada/software-stack-config/blob/a5557c946ca25e2ca41b74716557eb9f5ab5e9c1/lmod/SitePackage.lua#L203-L219 Related issues: ComputeCanada/software-stack#58 ComputeCanada/software-stack#79 This works in ubuntu22.04 on CUDA 12.2 (driver version`535.161.07`): ```bash mkdir /usr/lib64/nvidia cd /usr/lib64/nvidia ln -s ../../lib/x86_64-linux-gnu/libcuda.so . ln -s ../../lib/x86_64-linux-gnu/libcuda.so.1 . ``` Can test this by compiling and running the `vectorAdd` program in cuda-samples: https://github.com/NVIDIA/cuda-samples/tree/3559ca4d088e12db33d6918621cab5c998ccecf1/Samples/0_Introduction/vectorAdd Here's a diff to print out driver and runtime versions: ``` diff --git a/Samples/0_Introduction/vectorAdd/vectorAdd.cu b/Samples/0_Introduction/vectorAdd/vectorAdd.cu index 284b0f0e..3b22df2b 100644 --- a/Samples/0_Introduction/vectorAdd/vectorAdd.cu +++ b/Samples/0_Introduction/vectorAdd/vectorAdd.cu @@ -64,6 +64,30 @@ int main(void) { // Print the vector length to be used, and compute its size int numElements = 50000; size_t size = numElements * sizeof(float); + + int driverVersion = 0, runtimeVersion = 0; + + + cudaError_t error; + + // Get CUDA Driver Version + error = cudaDriverGetVersion(&driverVersion); + printf("cudaDriverGetVersion() - error: %d\n", error); + if (error != cudaSuccess) { + printf("cudaDriverGetVersion error: %d\n", error); + } else { + printf("CUDA Driver Version: %d.%d\n", driverVersion / 1000, (driverVersion % 100) / 10); + } + + // Get CUDA Runtime Version + error = cudaRuntimeGetVersion(&runtimeVersion); + printf("cudaRuntimeGetVersion() - error: %d\n", error); + if (error != cudaSuccess) { + printf("cudaRuntimeGetVersion error: %d\n", error); + } else { + printf("CUDA Runtime Version: %d.%d\n", runtimeVersion / 1000, (runtimeVersion % 100) / 10); + } + printf("[Vector addition of %d elements]\n", numElements); // Allocate the host input vector A ``` When the `/usr/lib64/nvidia/libcuda.so{,.1}` files don't exist, we get: ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 0.0 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)! ``` When everything works properly, we get: ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 12.2 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` Works on older driver versions as well, because CUDA has forward compatibility if the major version is the same. ``` cudaDriverGetVersion() - error: 0 CUDA Driver Version: 12.0 cudaRuntimeGetVersion() - error: 0 CUDA Runtime Version: 12.2 [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done ``` Binaries compiled in this way require `/cvmfs` and `/usr/lib64/nvidia` to work: ``` docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace nvidia/cuda:12.0.0-runtime-ubuntu22.04 /workspace/Samples/0_Introduction/vectorAdd/vectorAdd ``` Actually, those are the only paths (other than a matching base OS image) required for it to work: ``` docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace ubuntu /workspace/Samples/0_Introduction/vectorAdd/vectorAdd ``` Note that `/usr/lib64/nvidia/libcuda.so{,.1}` is a runtime dependency and not a build-time dependency.
WATonomous · Mar 17, 2024 · 514c8d5 · 514c8d5
1 parent 9f55ccd
commit 514c8d5
Showing 1 changed file with 55 additions and 0 deletions.
diff --git a/pages/docs/compute-cluster/slurm.mdx b/pages/docs/compute-cluster/slurm.mdx
@@ -116,6 +116,52 @@ srun --gres gpu:1 --pty bash
 
 This will allocate a whole GPU to your job. Note that this will prevent other jobs from using the GPU until your job is finished.
 
+### Using CUDA
+
+If your workload requires CUDA, you have a few options (not exhaustive):
+
+#### Using the `nvidia/cuda` Docker image
+
+You can use the `nvidia/cuda` Docker image to run CUDA workloads.
+Assuming you have started the Docker daemon (see [Using Docker](#using-docker)), you can run the following command to start a CUDA container:
+
+```bash
+docker run --rm -it --gpus all -v $(pwd):/workspace nvidia/cuda:12.0.0-devel-ubuntu22.04 nvcc --version
+```
+
+Note that the version of the Docker image must be compatible (usually this means lower than or equal to) with the driver version installed on the compute node.
+You can check the driver version by running `nvidia-smi`. If the driver version is not compatible with the Docker image, you will get an error that looks like this:
+
+```text
+> docker run --rm -it --gpus all -v $(pwd):/workspace nvidia/cuda:12.1.0-runtime-ubuntu22.04
+docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
+nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown.
+```
+
+#### Using the Compute Canada CUDA module
+
+The Compute Canada CVMFS[^cc-cvmfs] is mounted on the compute nodes. You can access CUDA by loading the appropriate module:
+
+```bash
+# Set up the module environment
+source /cvmfs/soft.computecanada.ca/config/profile/bash.sh
+# Load the appropriate environment
+module load StdEnv/2023
+# Load the CUDA module
+module load cuda/12.2
+# Check the nvcc version
+nvcc --version
+```
+
+Compute Canada only provides select versions of CUDA, and does not provide an easy way to list all available versions.
+A trick you can use is to run `which nvcc{:bash}` and trace back along the directory tree to find sibling directories
+that contain other CUDA versions.
+
+Note that the version of CUDA must be compatible with the driver version installed on the compute node.
+You can check the driver version by running `nvidia-smi`.
+You can find the CUDA compatibility matrix [here](https://docs.nvidia.com/deploy/cuda-compatibility/index.html).
+
+[^cc-cvmfs]: The [Compute Canada CVMFS](https://docs.alliancecan.ca/wiki/Accessing_CVMFS) is mounted at `/cvmfs/soft.computecanada.ca` on the compute nodes. It provides access to a wide variety of software via [Lmod modules](https://docs.alliancecan.ca/wiki/Utiliser_des_modules/en).
 
 ## Extra details
 
@@ -159,6 +205,15 @@ To request for `gpu`, use the `--gres gpu:<number_of_gpus>` flag.
 
 [^gpu-management]: For more information on GPU management, please refer to the [GPU Management](https://slurm.schedmd.com/gres.html#GPU_Management) SLURM documentation.
 
+### CVMFS
+
+CVMFS (CernVM File System)[^cvmfs] is a software distribution system that is widely adopted in the HPC community.
+It provides a way to distribute software to compute nodes without having to install them on the nodes themselves.
+
+We make use of the [Compute Canada CVMFS](https://docs.alliancecan.ca/wiki/Accessing_CVMFS) to provide access to software available on Compute Canada clusters.
+For example, you can access CUDA by loading the appropriate module (see [Using CUDA](#using-cuda)).
+
+[^cvmfs]: https://cvmfs.readthedocs.io/en/stable/
 
 {
 // Separate footnotes from the main content