Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate nvidia_drm preventing gpu reset when applying mig profile #387

Open
cmd-ntrf opened this issue Oct 1, 2024 · 1 comment
Open
Assignees
Labels
bug Something isn't working

Comments

@cmd-ntrf
Copy link
Member

cmd-ntrf commented Oct 1, 2024

mig-parted apply returns the following error in some circumstances:

time="2024-09-30T19:49:46Z" level=error msg="\nThe following GPUs could not be reset:\n  GPU 00000000:00:06.0: In use by another client\n\n1 device is currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using this device and all compute applications running in the system.\n"

There are no other services or processes that uses the GPU, but calling sudo modprobe -r nvidia_drm allows mig-parted to run after. Given that DRM stands for Direct Rendering Manager, I am not sure we need this kernel module.

@cmd-ntrf cmd-ntrf added the bug Something isn't working label Oct 1, 2024
@cmd-ntrf cmd-ntrf self-assigned this Oct 1, 2024
@bartoldeman
Copy link

As far as I know DRM and /dev/dri/cardX devices are used by EGL, and hence by VirtualGL if you want to run something on a compute node that renders via the GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants