[Feature]: Better headers in --showpids #166

al42and · 2024-05-02T13:29:25Z

Suggestion Description

rocm-smi --showpids reports the number of GPUs used by the process.
However, the presentation makes it easy to assume that it shows which GPUs are used.

We are having the users of our application confused, thinking that all the processes run on the same GPU:

$ rocm-smi --showpids


========================= ROCm System Management Interface =========================
================================== KFD Processes ===================================
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
get_compute_process_info_by_pid, Not supported on the given system
KFD process information
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
55573   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55571   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55574   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
55572   gmx_mpi         1       UNKNOWN         UNKNOWN         UNKNOWN     
====================================================================================
=============================== End of ROCm SMI Log ================================

Compare this with how nvidia-smi reports the similar thing:

$ nvidia-smi 
[......]
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    211667      C   gmx                               320MiB |
|    1   N/A  N/A    211667      C   gmx                               148MiB |
+-----------------------------------------------------------------------------+

It would be better if rocm-smi --showpids output was more clear that it reported the number of GPUs used, not their indices.

The help output is also unclear about the differences between the two options:

  --showpids                                                       Show current running KFD PIDs
  --showpidgpus [SHOWPIDGPUS [SHOWPIDGPUS ...]]                    Show GPUs used by specified KFD PIDs (all if no arg
                                                                   given)

With an old kernel, when rocm-smi cannot get the information, it is even more confusing: instead of N/A, it reports 0, which can be interpreted either as GPU #0 of that no GPUs are used: neither of that is correct!

$ rocm-smi  --showpids


======================= ROCm System Management Interface =======================
================================ KFD Processes =================================
Not supported on the given system
Not supported on the given system
Not supported on the given system
Not supported on the given system
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
129835  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129836  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129834  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
129837  gmx_mpi         0       UNKNOWN         UNKNOWN         UNKNOWN     
================================================================================
============================= End of ROCm SMI Log ==============================

Operating System

SLES 15

GPU

MI250X

ROCm Component

rocm_smi_lib

The text was updated successfully, but these errors were encountered:

littlecutebird · 2024-12-02T09:30:55Z

use both flag: rocm-smi --showpids --showpidgpus

al42and · 2024-12-10T16:02:33Z

use both flag: rocm-smi --showpids --showpidgpus

Sure, that works if the user already knows that there are two flags. My point is that the way --showpids alone behaves is misleading and causing confusion.

sohaibnd · 2024-12-23T20:59:51Z

Hi @al42and, sorry for the late response.

It would be better if rocm-smi --showpids output was more clear that it reported the number of GPUs used, not their indices.

The column name can be changed from "GPU(s)" to "# of GPUs" to be more clear. However, note that rocm-smi will be deprecated in the future in favor of the new amd-smi tool, so you may want to transition to using amd-smi in your application. In amd-smi, you can get information about the processes using your GPUs with the amd-smi process command (use amd-smi process --help for more options):

The help output is also unclear about the differences between the two options:

Could you elaborate on what's not clear?

With an old kernel, when rocm-smi cannot get the information, it is even more confusing: instead of N/A, it reports 0, which can be interpreted either as GPU #0 of that no GPUs are used: neither of that is correct!

What do you mean by old kernel here? Is it a process that has finished running?

al42and · 2024-12-23T23:02:30Z

rocm-smi will be deprecated in the future in favor of the new amd-smi tool

Nice! It's confusing to have multiple tools with different behavior, so deprecating one of the two makes sense. Will rocm-smi be gone in ROCm 7.x?

Could you elaborate on what's not clear?

One option says "Show current running KFD PIDs"; the other says "Show GPUs used by specified KFD PIDs (all if no arg given)". I don't see how it should be clear that the first one does not show which GPUs are used but still shows the number of GPUs used. The existence of the second option hints at that, but requiring the user to read that much into nuances is, IMO, not nice.

What do you mean by old kernel here?

Linux kernel. Some old version we had on our Cray machine at the time of filing this issue.

sohaibnd · 2024-12-23T23:43:21Z

Nice! It's confusing to have multiple tools with different behavior, so deprecating one of the two makes sense. Will rocm-smi be gone in ROCm 7.x?

I would expect it to be deprecated by that point but I can't say for sure.

One option says "Show current running KFD PIDs"; the other says "Show GPUs used by specified KFD PIDs (all if no arg given)". I don't see how it should be clear that the first one does not show which GPUs are used but still shows the number of GPUs used. The existence of the second option hints at that, but requiring the user to read that much into nuances is, IMO, not nice.

If we're being pedantic, the user shouldn't assume --showpids will include anything other than the current running KFD PIDs as mentioned in the help output. The other columns (like # of GPUs or process name) are supplementary output. If a user is looking for which GPUs are being used by a process, the help output also makes it clear that the appropriate option is --showpidgpus.
However, I do agree that the existence of both options is confusing. amd-smi is better in that regard, the commands and options are better organized.

Linux kernel. Some old version we had on our Cray machine at the time of filing this issue.

Is this issue still present in that old kernel version?

al42and · 2024-12-24T15:57:55Z

If we're being pedantic, the user shouldn't assume --showpids will include anything other than the current running KFD PIDs as mentioned in the help output.

Should the users assume that hipDriverGetVersion returns 4? Too bad, it returns something different. And, until recently, HIP documentation had a few cuda* symbols left around, e.g., hipIpcOpenEventHandle apparently took cudaIpcGetEventHandle as its input; should users also assume that is the case? :)

Things are improving, and that's the main thing. But, so far, ROCm has been actively discouraging users from reading too much into exact wording of its documentation :)

However, I do agree that the existence of both options is confusing. amd-smi is better in that regard, the commands and options are better organized.

👍

Is this issue still present in that old kernel version?

That old kernel version is no longer present on our machine, so cannot say.

sohaibnd · 2024-12-24T20:30:12Z

Should the users assume that hipDriverGetVersion returns 4? Too bad, it returns something different.

That is a mistake in the documentation. Thanks for pointing it out, I will get that fixed.

Things are improving, and that's the main thing. But, so far, ROCm has been actively discouraging users from reading too much into exact wording of its documentation :)

I share your concern. We are actively trying to improve our documentation so if you come across any other mistakes or confusing wording in the docs, let me know or create another ticket and we will have it fixed!

That old kernel version is no longer present on our machine, so cannot say.

Alright, if you come across it again or remember the kernel version and ROCm version the issue was reproduced on, let me know and I can look into the issue further.

ppanchad-amd added the enhancement label Oct 17, 2024

ppanchad-amd added the Under Investigation label Dec 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Better headers in --showpids #166

[Feature]: Better headers in --showpids #166

al42and commented May 2, 2024

littlecutebird commented Dec 2, 2024

al42and commented Dec 10, 2024

sohaibnd commented Dec 23, 2024

al42and commented Dec 23, 2024

sohaibnd commented Dec 23, 2024

al42and commented Dec 24, 2024

sohaibnd commented Dec 24, 2024

[Feature]: Better headers in --showpids #166

[Feature]: Better headers in --showpids #166

Comments

al42and commented May 2, 2024

Suggestion Description

Operating System

GPU

ROCm Component

littlecutebird commented Dec 2, 2024

al42and commented Dec 10, 2024

sohaibnd commented Dec 23, 2024

al42and commented Dec 23, 2024

sohaibnd commented Dec 23, 2024

al42and commented Dec 24, 2024

sohaibnd commented Dec 24, 2024