-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some runners currently have driver issues #165
Comments
Pax nightly tests are failing with a similar error. This is from the latest nightly tests:
|
Here's an image that had this issue for troubleshooting: ghcr.io/nvidia/t5x:nightly-2023-08-20 |
Related: #161 |
This turns out to be a driver issue on the runners. |
Our SLURM cluster is still suffering from the problem since the CUDA driver version there is R515, which per our forward compact guide is:
I'm working on a way to upgrade/downgrade the worker node images without recreating the entire cluster |
Any update on this? |
@yhtang Can we close this? |
Yes this is fixed. The nodes now have R525 "LTS" driver. |
This issue was first spotted by @maanug-nv .
Here is an example run with the error:
The runner it selects is:
The text was updated successfully, but these errors were encountered: