Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Better headers in --showpids #166

Open
al42and opened this issue May 2, 2024 · 7 comments
Open

[Feature]: Better headers in --showpids #166

al42and opened this issue May 2, 2024 · 7 comments

Comments

@al42and
Copy link

al42and commented May 2, 2024

Suggestion Description

rocm-smi --showpids reports the number of GPUs used by the process.
However, the presentation makes it easy to assume that it shows which GPUs are used.

We are having the users of our application confused, thinking that all the processes run on the same GPU:

$ rocm-smi --showpids


========================= ROCm System Management Interface =========================
================================== KFD Processes ===================================
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
KFD process information
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
55573   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55571   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55574   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55572   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
====================================================================================
=============================== End of ROCm SMI Log ================================

Compare this with how nvidia-smi reports the similar thing:

$ nvidia-smi 
[......]
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    211667      C   gmx                               320MiB |
|    1   N/A  N/A    211667      C   gmx                               148MiB |
+-----------------------------------------------------------------------------+

It would be better if rocm-smi --showpids output was more clear that it reported the number of GPUs used, not their indices.

The help output is also unclear about the differences between the two options:

  --showpids                                                       Show current running KFD PIDs
  --showpidgpus [SHOWPIDGPUS [SHOWPIDGPUS ...]]                    Show GPUs used by specified KFD PIDs (all if no arg
                                                                   given)

With an old kernel, when rocm-smi cannot get the information, it is even more confusing: instead of N/A, it reports 0, which can be interpreted either as GPU #0 of that no GPUs are used: neither of that is correct!

$ rocm-smi  --showpids


======================= ROCm System Management Interface =======================
================================ KFD Processes =================================
Not supported on the given system
Not supported on the given system
Not supported on the given system
Not supported on the given system
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
129835  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129836  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129834  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129837  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
================================================================================
============================= End of ROCm SMI Log ==============================

Operating System

SLES 15

GPU

MI250X

ROCm Component

rocm_smi_lib

@littlecutebird
Copy link

use both flag: rocm-smi --showpids --showpidgpus

@al42and
Copy link
Author

al42and commented Dec 10, 2024

use both flag: rocm-smi --showpids --showpidgpus

Sure, that works if the user already knows that there are two flags. My point is that the way --showpids alone behaves is misleading and causing confusion.

@sohaibnd
Copy link

Hi @al42and, sorry for the late response.

It would be better if rocm-smi --showpids output was more clear that it reported the number of GPUs used, not their indices.

The column name can be changed from "GPU(s)" to "# of GPUs" to be more clear. However, note that rocm-smi will be deprecated in the future in favor of the new amd-smi tool, so you may want to transition to using amd-smi in your application. In amd-smi, you can get information about the processes using your GPUs with the amd-smi process command (use amd-smi process --help for more options):

image

The help output is also unclear about the differences between the two options:

Could you elaborate on what's not clear?

With an old kernel, when rocm-smi cannot get the information, it is even more confusing: instead of N/A, it reports 0, which can be interpreted either as GPU #0 of that no GPUs are used: neither of that is correct!

What do you mean by old kernel here? Is it a process that has finished running?

@al42and
Copy link
Author

al42and commented Dec 23, 2024

rocm-smi will be deprecated in the future in favor of the new amd-smi tool

Nice! It's confusing to have multiple tools with different behavior, so deprecating one of the two makes sense. Will rocm-smi be gone in ROCm 7.x?

Could you elaborate on what's not clear?

One option says "Show current running KFD PIDs"; the other says "Show GPUs used by specified KFD PIDs (all if no arg given)". I don't see how it should be clear that the first one does not show which GPUs are used but still shows the number of GPUs used. The existence of the second option hints at that, but requiring the user to read that much into nuances is, IMO, not nice.

What do you mean by old kernel here?

Linux kernel. Some old version we had on our Cray machine at the time of filing this issue.

@sohaibnd
Copy link

Nice! It's confusing to have multiple tools with different behavior, so deprecating one of the two makes sense. Will rocm-smi be gone in ROCm 7.x?

I would expect it to be deprecated by that point but I can't say for sure.

One option says "Show current running KFD PIDs"; the other says "Show GPUs used by specified KFD PIDs (all if no arg given)". I don't see how it should be clear that the first one does not show which GPUs are used but still shows the number of GPUs used. The existence of the second option hints at that, but requiring the user to read that much into nuances is, IMO, not nice.

If we're being pedantic, the user shouldn't assume --showpids will include anything other than the current running KFD PIDs as mentioned in the help output. The other columns (like # of GPUs or process name) are supplementary output. If a user is looking for which GPUs are being used by a process, the help output also makes it clear that the appropriate option is --showpidgpus.
However, I do agree that the existence of both options is confusing. amd-smi is better in that regard, the commands and options are better organized.

Linux kernel. Some old version we had on our Cray machine at the time of filing this issue.

Is this issue still present in that old kernel version?

@al42and
Copy link
Author

al42and commented Dec 24, 2024

If we're being pedantic, the user shouldn't assume --showpids will include anything other than the current running KFD PIDs as mentioned in the help output.

Should the users assume that hipDriverGetVersion returns 4? Too bad, it returns something different. And, until recently, HIP documentation had a few cuda* symbols left around, e.g., hipIpcOpenEventHandle apparently took cudaIpcGetEventHandle as its input; should users also assume that is the case? :)

Things are improving, and that's the main thing. But, so far, ROCm has been actively discouraging users from reading too much into exact wording of its documentation :)

However, I do agree that the existence of both options is confusing. amd-smi is better in that regard, the commands and options are better organized.

👍

Is this issue still present in that old kernel version?

That old kernel version is no longer present on our machine, so cannot say.

@sohaibnd
Copy link

Should the users assume that hipDriverGetVersion returns 4? Too bad, it returns something different.

That is a mistake in the documentation. Thanks for pointing it out, I will get that fixed.

Things are improving, and that's the main thing. But, so far, ROCm has been actively discouraging users from reading too much into exact wording of its documentation :)

I share your concern. We are actively trying to improve our documentation so if you come across any other mistakes or confusing wording in the docs, let me know or create another ticket and we will have it fixed!

That old kernel version is no longer present on our machine, so cannot say.

Alright, if you come across it again or remember the kernel version and ROCm version the issue was reproduced on, let me know and I can look into the issue further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants