Skip to content

Commit

Permalink
Mount CVMFS on SLURM compute nodes (#2477)
Browse files Browse the repository at this point in the history
We can use it to run software like CUDA without having to install the
software.

Main doc: https://docs.alliancecan.ca/wiki/Accessing_CVMFS

### Notes on the Compute Canada CUDA library requirements

`/usr/lib64/nvidia/libcuda.so` and `/usr/lib64/nvidia/libcuda.so.1` must
exist. Otherwise we get error `CUDA driver version is insufficient for
CUDA runtime version`.

https://docs.alliancecan.ca/wiki/Accessing_CVMFS#CUDA_location

https://github.com/ComputeCanada/software-stack-config/blob/a5557c946ca25e2ca41b74716557eb9f5ab5e9c1/lmod/SitePackage.lua#L203-L219

Related issues:
ComputeCanada/software-stack#58
ComputeCanada/software-stack#79

This works in ubuntu22.04 on CUDA 12.2 (driver version`535.161.07`):

```bash
mkdir /usr/lib64/nvidia
cd /usr/lib64/nvidia
ln -s ../../lib/x86_64-linux-gnu/libcuda.so .
ln -s ../../lib/x86_64-linux-gnu/libcuda.so.1 .
```

Can test this by compiling and running the `vectorAdd` program in
cuda-samples:
https://github.com/NVIDIA/cuda-samples/tree/3559ca4d088e12db33d6918621cab5c998ccecf1/Samples/0_Introduction/vectorAdd

Here's a diff to print out driver and runtime versions:
```
diff --git a/Samples/0_Introduction/vectorAdd/vectorAdd.cu b/Samples/0_Introduction/vectorAdd/vectorAdd.cu
index 284b0f0e..3b22df2b 100644
--- a/Samples/0_Introduction/vectorAdd/vectorAdd.cu
+++ b/Samples/0_Introduction/vectorAdd/vectorAdd.cu
@@ -64,6 +64,30 @@ int main(void) {
   // Print the vector length to be used, and compute its size
   int numElements = 50000;
   size_t size = numElements * sizeof(float);
+
+  int driverVersion = 0, runtimeVersion = 0;
+
+
+  cudaError_t error;
+
+  // Get CUDA Driver Version
+  error = cudaDriverGetVersion(&driverVersion);
+  printf("cudaDriverGetVersion() - error: %d\n", error);
+  if (error != cudaSuccess) {
+      printf("cudaDriverGetVersion error: %d\n", error);
+  } else {
+      printf("CUDA Driver Version: %d.%d\n", driverVersion / 1000, (driverVersion % 100) / 10);
+  }
+
+  // Get CUDA Runtime Version
+  error = cudaRuntimeGetVersion(&runtimeVersion);
+  printf("cudaRuntimeGetVersion() - error: %d\n", error);
+  if (error != cudaSuccess) {
+      printf("cudaRuntimeGetVersion error: %d\n", error);
+  } else {
+      printf("CUDA Runtime Version: %d.%d\n", runtimeVersion / 1000, (runtimeVersion % 100) / 10);
+  }
+
   printf("[Vector addition of %d elements]\n", numElements);

   // Allocate the host input vector A
```

When the `/usr/lib64/nvidia/libcuda.so{,.1}` files don't exist, we get:
```
cudaDriverGetVersion() - error: 0
CUDA Driver Version: 0.0
cudaRuntimeGetVersion() - error: 0
CUDA Runtime Version: 12.2
[Vector addition of 50000 elements]
Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
```

When everything works properly, we get:
```
cudaDriverGetVersion() - error: 0
CUDA Driver Version: 12.2
cudaRuntimeGetVersion() - error: 0
CUDA Runtime Version: 12.2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Works on older driver versions as well, because CUDA has forward
compatibility if the major version is the same.
```
cudaDriverGetVersion() - error: 0
CUDA Driver Version: 12.0
cudaRuntimeGetVersion() - error: 0
CUDA Runtime Version: 12.2
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
```

Binaries compiled in this way require `/cvmfs` and `/usr/lib64/nvidia`
to work:
```
docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace nvidia/cuda:12.0.0-runtime-ubuntu22.04 /workspace/Samples/0_Introduction/vectorAdd/vectorAdd
```

Actually, those are the only paths (other than a matching base OS image)
required for it to work:
```
docker run --rm -it --gpus all -v /cvmfs:/cvmfs:ro -v /usr/lib64/nvidia:/usr/lib64/nvidia:ro -v /home/ben/Projects/cuda-samples:/workspace ubuntu /workspace/Samples/0_Introduction/vectorAdd/vectorAdd
```

Note that `/usr/lib64/nvidia/libcuda.so{,.1}` is a runtime dependency
and not a build-time dependency.
  • Loading branch information
ben-z authored Mar 17, 2024
1 parent 9f55ccd commit 514c8d5
Showing 1 changed file with 55 additions and 0 deletions.
55 changes: 55 additions & 0 deletions pages/docs/compute-cluster/slurm.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,52 @@ srun --gres gpu:1 --pty bash

This will allocate a whole GPU to your job. Note that this will prevent other jobs from using the GPU until your job is finished.

### Using CUDA

If your workload requires CUDA, you have a few options (not exhaustive):

#### Using the `nvidia/cuda` Docker image

You can use the `nvidia/cuda` Docker image to run CUDA workloads.
Assuming you have started the Docker daemon (see [Using Docker](#using-docker)), you can run the following command to start a CUDA container:

```bash
docker run --rm -it --gpus all -v $(pwd):/workspace nvidia/cuda:12.0.0-devel-ubuntu22.04 nvcc --version
```

Note that the version of the Docker image must be compatible (usually this means lower than or equal to) with the driver version installed on the compute node.
You can check the driver version by running `nvidia-smi`. If the driver version is not compatible with the Docker image, you will get an error that looks like this:

```text
> docker run --rm -it --gpus all -v $(pwd):/workspace nvidia/cuda:12.1.0-runtime-ubuntu22.04
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown.
```

#### Using the Compute Canada CUDA module

The Compute Canada CVMFS[^cc-cvmfs] is mounted on the compute nodes. You can access CUDA by loading the appropriate module:

```bash
# Set up the module environment
source /cvmfs/soft.computecanada.ca/config/profile/bash.sh
# Load the appropriate environment
module load StdEnv/2023
# Load the CUDA module
module load cuda/12.2
# Check the nvcc version
nvcc --version
```

Compute Canada only provides select versions of CUDA, and does not provide an easy way to list all available versions.
A trick you can use is to run `which nvcc{:bash}` and trace back along the directory tree to find sibling directories
that contain other CUDA versions.

Note that the version of CUDA must be compatible with the driver version installed on the compute node.
You can check the driver version by running `nvidia-smi`.
You can find the CUDA compatibility matrix [here](https://docs.nvidia.com/deploy/cuda-compatibility/index.html).

[^cc-cvmfs]: The [Compute Canada CVMFS](https://docs.alliancecan.ca/wiki/Accessing_CVMFS) is mounted at `/cvmfs/soft.computecanada.ca` on the compute nodes. It provides access to a wide variety of software via [Lmod modules](https://docs.alliancecan.ca/wiki/Utiliser_des_modules/en).

## Extra details

Expand Down Expand Up @@ -159,6 +205,15 @@ To request for `gpu`, use the `--gres gpu:<number_of_gpus>` flag.

[^gpu-management]: For more information on GPU management, please refer to the [GPU Management](https://slurm.schedmd.com/gres.html#GPU_Management) SLURM documentation.

### CVMFS

CVMFS (CernVM File System)[^cvmfs] is a software distribution system that is widely adopted in the HPC community.
It provides a way to distribute software to compute nodes without having to install them on the nodes themselves.

We make use of the [Compute Canada CVMFS](https://docs.alliancecan.ca/wiki/Accessing_CVMFS) to provide access to software available on Compute Canada clusters.
For example, you can access CUDA by loading the appropriate module (see [Using CUDA](#using-cuda)).

[^cvmfs]: https://cvmfs.readthedocs.io/en/stable/

{
// Separate footnotes from the main content
Expand Down

0 comments on commit 514c8d5

Please sign in to comment.