Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuQuantum appliance. Error detecting nvida GPU #169

Open
LB-Navarro opened this issue Dec 24, 2024 · 5 comments
Open

cuQuantum appliance. Error detecting nvida GPU #169

LB-Navarro opened this issue Dec 24, 2024 · 5 comments
Assignees

Comments

@LB-Navarro
Copy link

Hi all,
Im trying to run the cuquantum appliance container (cuquantum-appliance:24.03-x86_64) on a virtual machine
kernel: 4.18.0-305.25.1.el8_4.x86_64
OS: Red Hat Enterprise Linux 8.10 (Ootpa)
podman: podman version 4.9.4-rhel

the virtual host machine nvidia-smi command would look like:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.05    Driver Version: 525.85.05    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GRID A100D-8C       On   | 00000000:00:06.0 Off |                    0 |
| N/A   N/A    P0    N/A /  N/A |      0MiB /  8192MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Some hints that something does not run properly:

  1. The entrypoint.sh prompts that no nvidia devices are detected. This happens because this shell command fail to match any file:
    find /dev -name nvidia -type f
    The same command would fail on the host machine.
  2. The nvidia-smi command would work inside the container but any attempt to run any of the examples would trigger device errors
python simon.py
Secret string = [0 0 1]
CUDA error: operation not supported device_management.h 64

Any hints on what's the issue?
Thanks
luis

@ymagchi
Copy link
Collaborator

ymagchi commented Dec 24, 2024

Hi @LB-Navarro,

I would like to check your environments:

  1. Could you please check if device files exist in container image with the following command?
    find /dev -name "nvidia*" -type c

  2. Do you use CDI for Podman? I think this issue is not specific to cuQuantum Appliance container and some Podman configs need to be revised. (Please refer to https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-podman for CDI setups, Podman cannot use CUDA under rootless container. containers/podman#9926, Running nvidia-container-runtime with podman is blowing up. nvidia-container-runtime#85 for related issues).

@LB-Navarro
Copy link
Author

Hi there,
these are the devices exposed in the container:

(container) $ find /dev -name "nvidia*" -type c
/dev/nvidia0
/dev/nvidiactl
/dev/nvidia-uvm-tools
/dev/nvidia-uvm
/dev/nvidia-modeset

On your second point, yes I think im using CDI for podman. I had followed the instructions given in the nvidia container toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html) and running
sudo podman run -it --security-opt=label=disable --device=nvidia.com/gpu=all ourmirror_repo/cuquantum-appliance:24.03-x86_64
Some of the links you referred are older version and/or imply running the container in rootless mode. The weird part is that nvidia-smi, that typically is a good proxy to see that things are going well, gives exactly the same output in the host and in the container, yet the cuquantum examples fail to run properly.

let me know if you need some further info.

@ymagchi
Copy link
Collaborator

ymagchi commented Dec 30, 2024

Thank you for the information. Could you please try running with --privilege option to see if it results from other security restrictions?

@LB-Navarro
Copy link
Author

Hi,
It actually made visible the nvidia-cap devices:

find /dev -name "nvidia*" -type c
/dev/nvidia0
/dev/nvidiactl
/dev/nvidia-uvm-tools
/dev/nvidia-uvm
/dev/nvidia-modeset
/dev/nvidia-caps/nvidia-cap2
/dev/nvidia-caps/nvidia-cap1
  • still the container is launched with the "no nvidia devices detected"
  • still nvidia-smi is identical to the host machine
  • still the examples do not run well

@ymagchi
Copy link
Collaborator

ymagchi commented Jan 1, 2025

Thank you for checking that.

Could you please check if CDI are properly configured?

  • podman info | grep -i cgroup shows that cgroup v2 is used.
  • /etc/nvidia-container-runtime/config.toml file defines no-cgroups = false.
  • sudo nvidia-ctk cdi list show that nvidia.com/gpu=all has been recognized.
  • echo $NVIDIA_DRIVER_CAPABILITIES inside the image shows compute,utility.

If the followings show expected outputs, could you please share the error log by enabling debug = ... lines in the config.toml file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants