cuQuantum appliance. Error detecting nvida GPU #169

LB-Navarro · 2024-12-24T09:42:31Z

Hi all,
Im trying to run the cuquantum appliance container (cuquantum-appliance:24.03-x86_64) on a virtual machine
kernel: 4.18.0-305.25.1.el8_4.x86_64
OS: Red Hat Enterprise Linux 8.10 (Ootpa)
podman: podman version 4.9.4-rhel

the virtual host machine nvidia-smi command would look like:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID A100D-8C       On   | 00000000:00:06.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      0MiB /  8192MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Some hints that something does not run properly:

The entrypoint.sh prompts that no nvidia devices are detected. This happens because this shell command fail to match any file:
find /dev -name nvidia -type f
The same command would fail on the host machine.
The nvidia-smi command would work inside the container but any attempt to run any of the examples would trigger device errors

python simon.py
Secret string = [0 0 1]
CUDA error: operation not supported device_management.h 64

Any hints on what's the issue?
Thanks
luis

The text was updated successfully, but these errors were encountered:

ymagchi · 2024-12-24T17:50:00Z

Hi @LB-Navarro,

I would like to check your environments:

Could you please check if device files exist in container image with the following command?
find /dev -name "nvidia*" -type c
Do you use CDI for Podman? I think this issue is not specific to cuQuantum Appliance container and some Podman configs need to be revised. (Please refer to https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-podman for CDI setups, Podman cannot use CUDA under rootless container. containers/podman#9926, Running nvidia-container-runtime with podman is blowing up. nvidia-container-runtime#85 for related issues).

LB-Navarro · 2024-12-30T08:52:56Z

Hi there,
these are the devices exposed in the container:

(container) $ find /dev -name "nvidia*" -type c
/dev/nvidia0
/dev/nvidiactl
/dev/nvidia-uvm-tools
/dev/nvidia-uvm
/dev/nvidia-modeset

On your second point, yes I think im using CDI for podman. I had followed the instructions given in the nvidia container toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html) and running
sudo podman run -it --security-opt=label=disable --device=nvidia.com/gpu=all ourmirror_repo/cuquantum-appliance:24.03-x86_64
Some of the links you referred are older version and/or imply running the container in rootless mode. The weird part is that nvidia-smi, that typically is a good proxy to see that things are going well, gives exactly the same output in the host and in the container, yet the cuquantum examples fail to run properly.

let me know if you need some further info.

ymagchi · 2024-12-30T21:32:51Z

Thank you for the information. Could you please try running with --privilege option to see if it results from other security restrictions?

LB-Navarro · 2024-12-31T08:02:28Z

Hi,
It actually made visible the nvidia-cap devices:

find /dev -name "nvidia*" -type c
/dev/nvidia0
/dev/nvidiactl
/dev/nvidia-uvm-tools
/dev/nvidia-uvm
/dev/nvidia-modeset
/dev/nvidia-caps/nvidia-cap2
/dev/nvidia-caps/nvidia-cap1

still the container is launched with the "no nvidia devices detected"
still nvidia-smi is identical to the host machine
still the examples do not run well

ymagchi · 2025-01-01T03:00:16Z

Thank you for checking that.

Could you please check if CDI are properly configured?

podman info | grep -i cgroup shows that cgroup v2 is used.
/etc/nvidia-container-runtime/config.toml file defines no-cgroups = false.
sudo nvidia-ctk cdi list show that nvidia.com/gpu=all has been recognized.
echo $NVIDIA_DRIVER_CAPABILITIES inside the image shows compute,utility.

If the followings show expected outputs, could you please share the error log by enabling debug = ... lines in the config.toml file?

yangcal assigned ymagchi Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuQuantum appliance. Error detecting nvida GPU #169

cuQuantum appliance. Error detecting nvida GPU #169

LB-Navarro commented Dec 24, 2024

ymagchi commented Dec 24, 2024

LB-Navarro commented Dec 30, 2024

ymagchi commented Dec 30, 2024

LB-Navarro commented Dec 31, 2024

ymagchi commented Jan 1, 2025

cuQuantum appliance. Error detecting nvida GPU #169

cuQuantum appliance. Error detecting nvida GPU #169

Comments

LB-Navarro commented Dec 24, 2024

ymagchi commented Dec 24, 2024

LB-Navarro commented Dec 30, 2024

ymagchi commented Dec 30, 2024

LB-Navarro commented Dec 31, 2024

ymagchi commented Jan 1, 2025