Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify CUDA environment variables #213

Open
errordeveloper opened this issue Oct 22, 2024 · 10 comments
Open

Verify CUDA environment variables #213

errordeveloper opened this issue Oct 22, 2024 · 10 comments

Comments

@errordeveloper
Copy link

errordeveloper commented Oct 22, 2024

Currently enroot trusts CUDA environment variables and calls nvidia-container-cli without checking if drivers are install and whether shared libraries are present, e.g. libnvidia-ml.so.1.
This is problematic on Slurm cluster with a mix of CPU and GPU nodes. Slurm copies environment variables from the head node and there is little control over those. So a user with CUDA environment bars set in their shell on a login node cannot run job on CPU nodes, as enroot will error out as it fails to find libnvidia-ml.so.1.

$ NVIDIA_VISIBLE_DEVICES=all srun --container-image=centos grep PRETTY /etc/os-release
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     [WARN] Kernel module nvidia_uvm is not loaded. Make sure the NVIDIA device driver is installed and loaded.
slurmstepd: error: pyxis:     nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
@errordeveloper errordeveloper changed the title Verify CUDA environment variable Verify CUDA environment variables Oct 22, 2024
@krono
Copy link
Contributor

krono commented Oct 22, 2024

We had that recenlty on our cluster, but I would suggest that this is nothing enroot needs fixing for.

If you environment says "Use GPUs", and Slurm proagates that fact, it is not enroot's failure to act as if there are GPUs (my opinion).

I suggested to my users to

export NVIDIA_VISIBLE_DEVICES=void

and

srun … --container-env=NVIDIA_VISIBLE_DEVICES …

when using pyxis.

@errordeveloper
Copy link
Author

errordeveloper commented Oct 22, 2024 via email

@krono
Copy link
Contributor

krono commented Oct 22, 2024

There is a "preflight check": https://github.com/NVIDIA/enroot/blob/master/conf/hooks/98-nvidia.sh#L33
If the Env var is unset or void, it silently stops.
This is sane and useful behavior.

If you do not want GPU behavior, do not set these Env vars.

@errordeveloper
Copy link
Author

There is a "preflight check": https://github.com/NVIDIA/enroot/blob/master/conf/hooks/98-nvidia.sh#L33
If the Env var is unset or void, it silently stops.

Yes, I have seen this, however the rest of the code blindly assumes that nvidia-container-cli will be able to startup, but it's not when GPU driver is missing and it cannot load libnvidia-ml.so.1.

What I'm suggesting is a more concrete test that tries to run nvidia-container-cli --version or some other simple command like that, and/or checks if libnvidia-ml.so.1 exists etc.

I am speaking of additional verification, see how 99-mellanox.sh is actually a lot more thorough and doesn't assume all env vars are correct.

# Lookup all the devices and their respective driver.
for uevent in /sys/bus/pci/drivers/mlx?_core/*/infiniband_verbs/*/uevent; do
case "${uevent}" in
*mlx4*) drivers+=("mlx4") ;;
*mlx5*) drivers+=("mlx5") ;;
*) continue ;;
esac
devices+=("$(. "${uevent}"; echo "/dev/${DEVNAME}")")
done
# Lookup all the interfaces.
for uevent in /sys/bus/pci/drivers/mlx?_core/*/infiniband/*/uevent; do
ifaces+=("$(. "${uevent}"; echo "${NAME}")")
done
# Lookup all the management devices.
for uevent in /sys/bus/pci/drivers/mlx?_core/*/infiniband_mad/*/uevent; do
case "${uevent}" in
*issm*) issms+=("$(. "${uevent}"; echo "/dev/${DEVNAME}")") ;;
*umad*) umads+=("$(. "${uevent}"; echo "/dev/${DEVNAME}")") ;;
*) continue ;;
esac

One big issue with env vars in general, is that users cannot always control them and may be unaware of what environment variable they have. So any applications that relies on environment variables for configuration, should ideally take additional steps to verify them.

@krono
Copy link
Contributor

krono commented Oct 22, 2024

Or you could remove 98-nvidia.sh on the machines w/o GPUs…

@errordeveloper
Copy link
Author

Or you could remove 98-nvidia.sh on the machines w/o GPUs…

Yes, that's a workaround I've arrived at, just hoping maintainers might be willing to find a better solution.

@krono
Copy link
Contributor

krono commented Oct 22, 2024

Ultimately, it's up to you which hook you deploy; if these "hooks" were integral part of enroot, the devs would probably not have them externalised into hook.
But again, that's just my opinion :)

@3XX0
Copy link
Member

3XX0 commented Oct 22, 2024

Or add the following on your CPU nodes:

# /etc/enroot/environ.d/50-cpu.env
NVIDIA_VISIBLE_DEVICES="void"

You can also write your own hook which sets this conditionally if you're using GRES.

@errordeveloper
Copy link
Author

Or add the following on your CPU nodes:

# /etc/enroot/environ.d/50-cpu.env
NVIDIA_VISIBLE_DEVICES="void"

I am pretty sure we have tried this and it didn't work, are you sure this would be loaded before hooks?

You can also write your own hook which sets this conditionally if you're using GRES.

Any pointers? I don't see much docs around the hooks and from reading '98-nvidia.sh' and '99-mellanox.sh', I am not sure I can tell how these are meant to work. First one does an 'exec', while the second would run after that, but how is that meant to happen if there was an 'exec'?

@flx42
Copy link
Member

flx42 commented Oct 23, 2024

I am pretty sure we have tried this and it didn't work, are you sure this would be loaded before hooks?

That's what we use on our clusters, it should work fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants