-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] amdsmi
bindings integration
#132
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked at a high level. Thanks for your work!
I think 0.5 seconds is just right, assuming it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! Just a couple nits :)
zeus/device/gpu/amd.py
Outdated
"""Check if the GPU supports retrieving total energy consumption. Returns a future object of the result.""" | ||
wait_time = 0.5 # seconds | ||
threshold = 0.8 # 80% threshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to first check whether self._supportsGetTotalEnergyConsumption is not None
and return the cached value for future invocations of this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the current structure it's never called more than once. But that would be better, I'll add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is part of the public API of GPU
and GPUs
, meaning people can always call it in their code. Thanks!
Co-authored-by: Jae-Won Chung <[email protected]>
@parthraut Is this ready for review? Did stuff work properly on the AMD HPC cluster? |
Co-authored-by: Jae-Won Chung <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @parthraut! I've made a few specific suggestions, so please take a look and apply as you see fit. After these, I think it LGTM.
zeus/device/gpu/amd.py
Outdated
# set _supportsInstantPowerUsage for all GPUs | ||
for gpu in self._gpus: | ||
if gpu.getInstantPowerUsage() == "N/A": | ||
gpu._supportsInstantPowerUsage = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. There's an int()
cast inside getInstantPowerUsage
. Wouldn't this raise an error? In any case, this API design isn't clean. A function with a return value of int
should always return an int
regardless of the circumstances. Otherwise this defeats our purpose of static type checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# set _supportsInstantPowerUsage for all GPUs | |
for gpu in self._gpus: | |
if gpu.getInstantPowerUsage() == "N/A": | |
gpu._supportsInstantPowerUsage = False | |
# set _supportsInstantPowerUsage for all GPUs | |
for gpu in self._gpus: | |
gpu._supportsInstantPowerUsage = isinstance( | |
amdsmi.amdsmi_get_power_info(gpu.handle)["current_socket_power"], | |
int, | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the output of amdsmi.amdsmi_get_power_info
, and the value of current/average socket power is already int
. So, the int()
casting before returning is redundant. Please remove it.
Co-authored-by: Jae-Won Chung <[email protected]>
Co-authored-by: Jae-Won Chung <[email protected]>
What is the status of this PR? |
amdsmi
bindings integrationamdsmi
bindings integration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks a lot!
A few questions:
zeus/zeus/device/gpu/amd.py:247:13 - error: Operator "*" not supported for types "c_uint32" and "Literal[1000]" (reportOperatorIssue) zeus/zeus/device/gpu/amd.py:258:9 - error: Method "supportsGetTotalEnergyConsumption" overrides class "GPU" in an incompatible manner Positional parameter count mismatch; base method has 1, but override has 2 (reportIncompatibleMethodOverride) zeus/zeus/device/gpu/amd.py:280:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:282:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:293:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:344:19 - error: Cannot access attribute "value" for class "AmdSmiException" Attribute "value" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:346:37 - error: Cannot access attribute "msg" for class "AmdSmiException"
I can fix these, but do you think this is the right approach? I wanted to make sure before tweaking base class signatures.