Verify CUDA environment variables #213

errordeveloper · 2024-10-22T06:28:52Z

Currently enroot trusts CUDA environment variables and calls nvidia-container-cli without checking if drivers are install and whether shared libraries are present, e.g. libnvidia-ml.so.1.
This is problematic on Slurm cluster with a mix of CPU and GPU nodes. Slurm copies environment variables from the head node and there is little control over those. So a user with CUDA environment bars set in their shell on a login node cannot run job on CPU nodes, as enroot will error out as it fails to find libnvidia-ml.so.1.

$ NVIDIA_VISIBLE_DEVICES=all srun --container-image=centos grep PRETTY /etc/os-release
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     [WARN] Kernel module nvidia_uvm is not loaded. Make sure the NVIDIA device driver is installed and loaded.
slurmstepd: error: pyxis:     nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack

The text was updated successfully, but these errors were encountered:

krono · 2024-10-22T07:05:50Z

We had that recenlty on our cluster, but I would suggest that this is nothing enroot needs fixing for.

If you environment says "Use GPUs", and Slurm proagates that fact, it is not enroot's failure to act as if there are GPUs (my opinion).

I suggested to my users to

export NVIDIA_VISIBLE_DEVICES=void

and

srun … --container-env=NVIDIA_VISIBLE_DEVICES …

when using pyxis.

errordeveloper · 2024-10-22T08:11:29Z

I would say that ‘98-nvidia.sh’ could just use a few simple preflight checks, e.g. if libraries are available, something like ‘nvidia-container-cli --version’ might suffice. Or perhaps devfs could be checked. The other script, ‘99-mallanox.sh’ has a few extra checks, including sysfs.

krono · 2024-10-22T08:18:02Z

There is a "preflight check": https://github.com/NVIDIA/enroot/blob/master/conf/hooks/98-nvidia.sh#L33
If the Env var is unset or void, it silently stops.
This is sane and useful behavior.

If you do not want GPU behavior, do not set these Env vars.

errordeveloper · 2024-10-22T11:27:26Z

There is a "preflight check": https://github.com/NVIDIA/enroot/blob/master/conf/hooks/98-nvidia.sh#L33
If the Env var is unset or void, it silently stops.

Yes, I have seen this, however the rest of the code blindly assumes that nvidia-container-cli will be able to startup, but it's not when GPU driver is missing and it cannot load libnvidia-ml.so.1.

What I'm suggesting is a more concrete test that tries to run nvidia-container-cli --version or some other simple command like that, and/or checks if libnvidia-ml.so.1 exists etc.

I am speaking of additional verification, see how 99-mellanox.sh is actually a lot more thorough and doesn't assume all env vars are correct.

enroot/conf/hooks/99-mellanox.sh

Lines 42 to 63 in 09ae4b2

    
           # Lookup all the devices and their respective driver. 
        
           for uevent in /sys/bus/pci/drivers/mlx?_core/*/infiniband_verbs/*/uevent; do 
        
               case "${uevent}" in 
        
               *mlx4*) drivers+=("mlx4") ;; 
        
               *mlx5*) drivers+=("mlx5") ;; 
        
               *) continue ;; 
        
               esac 
        
               devices+=("$(. "${uevent}"; echo "/dev/${DEVNAME}")") 
        
           done 
        
           # Lookup all the interfaces. 
        
           for uevent in /sys/bus/pci/drivers/mlx?_core/*/infiniband/*/uevent; do 
        
               ifaces+=("$(. "${uevent}"; echo "${NAME}")") 
        
           done 
        
           # Lookup all the management devices. 
        
           for uevent in /sys/bus/pci/drivers/mlx?_core/*/infiniband_mad/*/uevent; do 
        
               case "${uevent}" in 
        
               *issm*) issms+=("$(. "${uevent}"; echo "/dev/${DEVNAME}")") ;; 
        
               *umad*) umads+=("$(. "${uevent}"; echo "/dev/${DEVNAME}")") ;; 
        
               *) continue ;; 
        
               esac

One big issue with env vars in general, is that users cannot always control them and may be unaware of what environment variable they have. So any applications that relies on environment variables for configuration, should ideally take additional steps to verify them.

krono · 2024-10-22T11:29:13Z

Or you could remove 98-nvidia.sh on the machines w/o GPUs…

errordeveloper · 2024-10-22T12:47:06Z

Or you could remove 98-nvidia.sh on the machines w/o GPUs…

Yes, that's a workaround I've arrived at, just hoping maintainers might be willing to find a better solution.

krono · 2024-10-22T12:56:16Z

Ultimately, it's up to you which hook you deploy; if these "hooks" were integral part of enroot, the devs would probably not have them externalised into hook.
But again, that's just my opinion :)

3XX0 · 2024-10-22T17:25:21Z

Or add the following on your CPU nodes:

# /etc/enroot/environ.d/50-cpu.env
NVIDIA_VISIBLE_DEVICES="void"

You can also write your own hook which sets this conditionally if you're using GRES.

errordeveloper · 2024-10-23T20:57:05Z

Or add the following on your CPU nodes:
# /etc/enroot/environ.d/50-cpu.env
NVIDIA_VISIBLE_DEVICES="void"

I am pretty sure we have tried this and it didn't work, are you sure this would be loaded before hooks?

You can also write your own hook which sets this conditionally if you're using GRES.

Any pointers? I don't see much docs around the hooks and from reading '98-nvidia.sh' and '99-mellanox.sh', I am not sure I can tell how these are meant to work. First one does an 'exec', while the second would run after that, but how is that meant to happen if there was an 'exec'?

flx42 · 2024-10-23T21:07:08Z

I am pretty sure we have tried this and it didn't work, are you sure this would be loaded before hooks?

That's what we use on our clusters, it should work fine.

errordeveloper changed the title ~~Verify CUDA environment variable~~ Verify CUDA environment variables Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify CUDA environment variables #213

Verify CUDA environment variables #213

errordeveloper commented Oct 22, 2024 •

edited

Loading

krono commented Oct 22, 2024

errordeveloper commented Oct 22, 2024 via email •

edited

Loading

krono commented Oct 22, 2024

errordeveloper commented Oct 22, 2024

krono commented Oct 22, 2024

errordeveloper commented Oct 22, 2024

krono commented Oct 22, 2024

3XX0 commented Oct 22, 2024

errordeveloper commented Oct 23, 2024

flx42 commented Oct 23, 2024

Verify CUDA environment variables #213

Verify CUDA environment variables #213

Comments

errordeveloper commented Oct 22, 2024 • edited Loading

krono commented Oct 22, 2024

errordeveloper commented Oct 22, 2024 via email • edited Loading

krono commented Oct 22, 2024

errordeveloper commented Oct 22, 2024

krono commented Oct 22, 2024

errordeveloper commented Oct 22, 2024

krono commented Oct 22, 2024

3XX0 commented Oct 22, 2024

errordeveloper commented Oct 23, 2024

flx42 commented Oct 23, 2024

errordeveloper commented Oct 22, 2024 •

edited

Loading

errordeveloper commented Oct 22, 2024 via email •

edited

Loading