You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I know this is a strange request, but we've been facing a bug in just the MI100 GPUs when we try to read its cumulative energy counter with amd-smi. One of the suggestions from the amdsmi team was to reboot the machine: ROCm/amdsmi#38 (comment)
Is there any way we can test and see if the reboot works?
In case it helps, I just verified that the counter problem still exists in one of the MI1008x node. Job allocation ID was 61463.
Thank you.
The text was updated successfully, but these errors were encountered:
t006-009 has been rebooted so you can test this out. You should be able to request the host specifically by adding the following to your slurm request:
Hi,
I know this is a strange request, but we've been facing a bug in just the MI100 GPUs when we try to read its cumulative energy counter with
amd-smi
. One of the suggestions from the amdsmi team was to reboot the machine: ROCm/amdsmi#38 (comment)Is there any way we can test and see if the reboot works?
In case it helps, I just verified that the counter problem still exists in one of the MI1008x node. Job allocation ID was 61463.
Thank you.
The text was updated successfully, but these errors were encountered: