-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: Incorrect Energy Consumption Reported by amdsmi_get_energy_count() Method #38
Comments
@parthraut Can you give me the output of I believe that #22 sees the same thing. We had an issue with updating our internal tables within the same instance of amd-smi. If that is the case the 6.1.X branch has this fix. |
$ amd-smi version |
To chime in, this happens when the GPU is idle -- the energy counter does not change at all. (The issue reported by #22 was mostly resolved by the update to ROCm 6.1.2. Especially, the |
Hi @marifamd, I was wondering about the status of this issue. Will this be fixed on the next ROCm version? |
I just discovered that with the same ROCm and amd-smi versions, energy counters work correctly on MI210 and MI250 GPUs, whereas the exact same amdsmi script doesn't work on MI100. So I think this is a not an issue of amd-smi or rocm-smi. Then would this be a ROCm problem? Or are there more layers in between? @marifamd |
@jaywonchung You are correct, thanks for the debug support. @parthraut We will look into this with our driver team and reach back out, I'm not seeing these issues on mi200, mi210 nor mi300X. This is how the expected output should look:
|
Yep, our output on MI210 are as expected:
Now this is no longer a blocker for us. Please kindly keep us posted on the MI100 driver issue. Thanks. |
Could you try rebooting your MI100? Would like to check if this also fixes your issue. I am seeing the counters update on my system - after a reboot. Before rebooting, I observed the same behavior as you. We're talking with the driver team to see if there is a limitation for MI100 energy counters, will keep you updated as we learn more.
|
I see. We actually do not own the nodes; we're using those generously provided by the AMD HPC Fund. I've posted a request there: AMDResearch/hpcfund#25. |
@charis-poag-amd We were just now able to try it out on a freshly rebooted node, but unfortunately it seems like the issue persists.
|
Hi,
Both CLIs ( This experiment was performed on an Instinct MI210 and we are using ROCm version 6.1.3 |
Regarding the discrepancy between AMD SMI PY and AMD SMI CLI -- Did you multiply the energy counter and counter resolution when using the raw Python bindings? I figured the multiplication has to be done from the following: amdsmi/amdsmi_cli/amdsmi_commands.py Lines 1878 to 1897 in f4506cf
|
THIS IS IT! I had to multiply the the counter with the resolution and now the result matches the reported energy from both AMD-SMI CLI and ROCM-SMI CLI, passing the multiple sanity checks. This does make question what |
@AdithyaRaman @jaywonchung Thank you for bringing this to our attention. We've determined that energy_dict['power'] is indeed misleading so we will update the value to say energy_accumulator instead. |
Thanks @gabrpham, I think that will improve the API. After the change, it would be nice if there's a way to maintain backwards compatibility w.r.t. the AMDSMI version with some simple logic. For example: if "energy_counter" in energy_dict: # New API
energy = energy_dict["energy_counter"]
elif "power" in energy_dict and "counter_resolution" in energy_dict: # Old API
energy = energy_dict["power"] * energy_dict["counter_resolution"]
else:
raise ValueError |
@jaywonchung It would be more like:
We were incorrectly labeling the "energy_counter" as "power". This change will be in the C library and the Python API. We'll update the full breadth in the changelog. However the values are not changing. |
…ccumulator` Issue linked here: #38 Signed-off-by: gabrpham <[email protected]> Change-Id: I622236eb3f0144aefeb6c82d2713b4822bfeeb11
@marifamd @charis-poag-amd Hi, I was just wondering if we have some updates regarding this issue. Thanks. |
@jaywonchung We are currently testing a FW update for the mi100 internally. After we confirm that it's working we will get you an ETA on when it's publicly available. |
That's awesome, thanks a lot! I'm assuming FW means firmware; it is something that can be patched with a ROCm driver update, or is it something that has to be deployed separately? |
FW is firmware. It would come in the "amdgpu-dkms-firmware" package, which is paired with the "amdgpu-dkms" package, both of which are considered the "ROCm kernel" package. It gets installed with --usecase=rocm, --usecase=graphics and --usecase=dkms |
Problem Description
Issue with
amdsmi.amdsmi_get_energy_count()
MethodDescription
When using the
amdsmi.amdsmi_get_energy_count()
method, the change in total energy consumption reported in Joules is much less than what it should be. This is evident when using the AMDSMI CLI tool to query the total energy consumption.Observed Behavior
When running
amd-smi metric -pE
, the output is as follows:GPU: 0
POWER:
SOCKET_POWER: 35 W
GFX_VOLTAGE: N/A mV
SOC_VOLTAGE: N/A mV
MEM_VOLTAGE: N/A mV
POWER_MANAGEMENT: ENABLED
THROTTLE_STATUS: UNTHROTTLED
ENERGY:
TOTAL_ENERGY_CONSUMPTION: 16.43 J
...
After waiting for one second and retrying:
GPU: 0
POWER:
SOCKET_POWER: 35 W
GFX_VOLTAGE: N/A mV
SOC_VOLTAGE: N/A mV
MEM_VOLTAGE: N/A mV
POWER_MANAGEMENT: ENABLED
THROTTLE_STATUS: UNTHROTTLED
ENERGY:
TOTAL_ENERGY_CONSUMPTION: 16.43 J
...
Expected Behavior
This does not make sense. The formula E = P * t means that the total energy consumption should have increased by ~35J after one second. But it does not seem to change.
Operating System
NAME="Rocky Linux", VERSION="9.1 (Blue Onyx)"
CPU
AMD EPYC 7V13 64-Core Processor
GPU
AMD Instinct MI100
ROCm Version
ROCm 6.1.0
ROCm Component
amdsmi
Steps to Reproduce
This shell script can help replicate the issue. It runs
amd-smi metric
and waits 5 seconds:With my output being:
Initial energy consumed by GPU 0: 19.748 J
Final energy consumed by GPU 0: 19.748 J
Energy consumed by GPU 0 in the last five seconds: 0 J
Expected energy consumption in last 5 seconds: 160 J
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
We are using the AMD HPC cluster.
The text was updated successfully, but these errors were encountered: