-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify CUDA environment variables #213
Comments
We had that recenlty on our cluster, but I would suggest that this is nothing If you environment says "Use GPUs", and Slurm proagates that fact, it is not I suggested to my users to
and
when using pyxis. |
I would say that ‘98-nvidia.sh’ could just use a few simple preflight
checks, e.g. if libraries are available, something like
‘nvidia-container-cli --version’ might suffice. Or perhaps devfs could be
checked. The other script, ‘99-mallanox.sh’ has a few extra checks,
including sysfs.
|
There is a "preflight check": https://github.com/NVIDIA/enroot/blob/master/conf/hooks/98-nvidia.sh#L33 If you do not want GPU behavior, do not set these Env vars. |
Yes, I have seen this, however the rest of the code blindly assumes that What I'm suggesting is a more concrete test that tries to run I am speaking of additional verification, see how enroot/conf/hooks/99-mellanox.sh Lines 42 to 63 in 09ae4b2
One big issue with env vars in general, is that users cannot always control them and may be unaware of what environment variable they have. So any applications that relies on environment variables for configuration, should ideally take additional steps to verify them. |
Or you could remove |
Yes, that's a workaround I've arrived at, just hoping maintainers might be willing to find a better solution. |
Ultimately, it's up to you which hook you deploy; if these "hooks" were integral part of enroot, the devs would probably not have them externalised into hook. |
Or add the following on your CPU nodes:
You can also write your own hook which sets this conditionally if you're using GRES. |
I am pretty sure we have tried this and it didn't work, are you sure this would be loaded before hooks?
Any pointers? I don't see much docs around the hooks and from reading '98-nvidia.sh' and '99-mellanox.sh', I am not sure I can tell how these are meant to work. First one does an 'exec', while the second would run after that, but how is that meant to happen if there was an 'exec'? |
That's what we use on our clusters, it should work fine. |
Currently
enroot
trusts CUDA environment variables and callsnvidia-container-cli
without checking if drivers are install and whether shared libraries are present, e.g.libnvidia-ml.so.1
.This is problematic on Slurm cluster with a mix of CPU and GPU nodes. Slurm copies environment variables from the head node and there is little control over those. So a user with CUDA environment bars set in their shell on a login node cannot run job on CPU nodes, as
enroot
will error out as it fails to findlibnvidia-ml.so.1
.The text was updated successfully, but these errors were encountered: