Some runners currently have driver issues #165

terrykong · 2023-08-18T05:38:03Z

This issue was first spotted by @maanug-nv .

Here is an example run with the error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown.
time="2023-08-15T21:50:48Z" level=error msg="error waiting for container: "
Error: Process completed with exit code 125.

The runner it selects is:

runs-on: [self-hosted, V100]

The text was updated successfully, but these errors were encountered:

ashors1 · 2023-08-21T15:47:33Z

Pax nightly tests are failing with a similar error. This is from the latest nightly tests:

printing enroot log file:
slurmstepd: error: pyxis:     nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.2, please update your driver to a newer version, or use an earlier cuda container
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

terrykong · 2023-08-22T17:43:20Z

Here's an image that had this issue for troubleshooting: ghcr.io/nvidia/t5x:nightly-2023-08-20

terrykong · 2023-08-24T03:45:19Z

Related: #161

yhtang · 2023-08-28T05:40:23Z

This turns out to be a driver issue on the runners.
In particular, our V100 runners use OKE (Oracle K8s Engine) images that are roughly updated every month. The GPU-2023.06.30-0 image 'bumped' the CUDA driver version to 495.29.05, which is excluded from the forward compatibility support matrix. Prior to the bump, the driver was 470.57.02, which is older but fully supported by forward compatibility. Fortunately, a new image is now available with R525 drivers to support CUDA up to 12.2.

yhtang · 2023-08-28T05:42:46Z

Our SLURM cluster is still suffering from the problem since the CUDA driver version there is R515, which per our forward compact guide is:

Branches R515, R510, R465, R460, R455, R440, R418, R410, R396, R390 are end of life and are not supported targets for compatibility.

I'm working on a way to upgrade/downgrade the worker node images without recreating the entire cluster

nouiz · 2023-09-05T13:59:44Z

Any update on this?

nouiz · 2023-10-10T14:43:39Z

@yhtang Can we close this?

yhtang · 2023-10-10T14:57:01Z

Yes this is fixed. The nodes now have R525 "LTS" driver.

terrykong assigned yhtang Aug 18, 2023

terrykong mentioned this issue Aug 24, 2023

Revert "Bump CUDA version to 12.2.0-devel-ubuntu22.04" #176

Closed

terrykong mentioned this issue Sep 6, 2023

Update the CUDA version to CUDA 12.2 #207

Closed

yhtang closed this as completed Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some runners currently have driver issues #165

Some runners currently have driver issues #165

terrykong commented Aug 18, 2023

ashors1 commented Aug 21, 2023

terrykong commented Aug 22, 2023

terrykong commented Aug 24, 2023

yhtang commented Aug 28, 2023 •

edited

Loading

yhtang commented Aug 28, 2023

nouiz commented Sep 5, 2023

nouiz commented Oct 10, 2023

yhtang commented Oct 10, 2023

Some runners currently have driver issues #165

Some runners currently have driver issues #165

Comments

terrykong commented Aug 18, 2023

ashors1 commented Aug 21, 2023

terrykong commented Aug 22, 2023

terrykong commented Aug 24, 2023

yhtang commented Aug 28, 2023 • edited Loading

yhtang commented Aug 28, 2023

nouiz commented Sep 5, 2023

nouiz commented Oct 10, 2023

yhtang commented Oct 10, 2023

yhtang commented Aug 28, 2023 •

edited

Loading