Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some runners currently have driver issues #165

Closed
terrykong opened this issue Aug 18, 2023 · 8 comments
Closed

Some runners currently have driver issues #165

terrykong opened this issue Aug 18, 2023 · 8 comments
Assignees

Comments

@terrykong
Copy link
Contributor

This issue was first spotted by @maanug-nv .

Here is an example run with the error:

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.1, please update your driver to a newer version, or use an earlier cuda container: unknown.
time="2023-08-15T21:50:48Z" level=error msg="error waiting for container: "
Error: Process completed with exit code 125.

The runner it selects is:

runs-on: [self-hosted, V100]
@ashors1
Copy link
Contributor

ashors1 commented Aug 21, 2023

Pax nightly tests are failing with a similar error. This is from the latest nightly tests:

printing enroot log file:
slurmstepd: error: pyxis:     nvidia-container-cli: requirement error: unsatisfied condition: cuda>=12.2, please update your driver to a newer version, or use an earlier cuda container
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

@terrykong
Copy link
Contributor Author

Here's an image that had this issue for troubleshooting: ghcr.io/nvidia/t5x:nightly-2023-08-20

@terrykong
Copy link
Contributor Author

Related: #161

@yhtang
Copy link
Collaborator

yhtang commented Aug 28, 2023

This turns out to be a driver issue on the runners.
In particular, our V100 runners use OKE (Oracle K8s Engine) images that are roughly updated every month. The GPU-2023.06.30-0 image 'bumped' the CUDA driver version to 495.29.05, which is excluded from the forward compatibility support matrix. Prior to the bump, the driver was 470.57.02, which is older but fully supported by forward compatibility. Fortunately, a new image is now available with R525 drivers to support CUDA up to 12.2.

@yhtang
Copy link
Collaborator

yhtang commented Aug 28, 2023

Our SLURM cluster is still suffering from the problem since the CUDA driver version there is R515, which per our forward compact guide is:

Branches R515, R510, R465, R460, R455, R440, R418, R410, R396, R390 are end of life and are not supported targets for compatibility.

I'm working on a way to upgrade/downgrade the worker node images without recreating the entire cluster

@nouiz
Copy link
Collaborator

nouiz commented Sep 5, 2023

Any update on this?

@nouiz
Copy link
Collaborator

nouiz commented Oct 10, 2023

@yhtang Can we close this?

@yhtang
Copy link
Collaborator

yhtang commented Oct 10, 2023

Yes this is fixed. The nodes now have R525 "LTS" driver.

@yhtang yhtang closed this as completed Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants