Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feat] amdsmi bindings integration #132

Merged
merged 22 commits into from
Nov 15, 2024
Merged

[Feat] amdsmi bindings integration #132

merged 22 commits into from
Nov 15, 2024

Conversation

parthraut
Copy link
Collaborator

@parthraut parthraut commented Oct 17, 2024

  • Added amdsmi to project dependency
  • Implemented a non-blocking contructor of AMDGPU using concurrent.futures.ThreadPoolExecutor (AMDGPUs constructor takes 0.5s (polling time) regardless of number of GPUs)
  • Right now, it succeeds if the measured value is in 1% of expected value, and waits 0.5s. These can be changed.

A few questions:

  1. What do you think about the threshold (1%) and wait time (0.5s)?
  2. This is failing pyright check:
    zeus/zeus/device/gpu/amd.py:247:13 - error: Operator "*" not supported for types "c_uint32" and "Literal[1000]" (reportOperatorIssue) zeus/zeus/device/gpu/amd.py:258:9 - error: Method "supportsGetTotalEnergyConsumption" overrides class "GPU" in an incompatible manner Positional parameter count mismatch; base method has 1, but override has 2 (reportIncompatibleMethodOverride) zeus/zeus/device/gpu/amd.py:280:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:282:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:293:26 - error: Cannot assign to attribute "_supportsGetTotalEnergyConsumption" for class "AMDGPU*" Attribute "_supportsGetTotalEnergyConsumption" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:344:19 - error: Cannot access attribute "value" for class "AmdSmiException" Attribute "value" is unknown (reportAttributeAccessIssue) zeus/zeus/device/gpu/amd.py:346:37 - error: Cannot access attribute "msg" for class "AmdSmiException"
    I can fix these, but do you think this is the right approach? I wanted to make sure before tweaking base class signatures.

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at a high level. Thanks for your work!

I think 0.5 seconds is just right, assuming it works.

zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! Just a couple nits :)

zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
Comment on lines 257 to 259
"""Check if the GPU supports retrieving total energy consumption. Returns a future object of the result."""
wait_time = 0.5 # seconds
threshold = 0.8 # 80% threshold
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to first check whether self._supportsGetTotalEnergyConsumption is not None and return the cached value for future invocations of this method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the current structure it's never called more than once. But that would be better, I'll add it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is part of the public API of GPU and GPUs, meaning people can always call it in their code. Thanks!

zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
@jaywonchung
Copy link
Member

jaywonchung commented Oct 27, 2024

@parthraut Is this ready for review? Did stuff work properly on the AMD HPC cluster?

zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @parthraut! I've made a few specific suggestions, so please take a look and apply as you see fit. After these, I think it LGTM.

zeus/device/gpu/nvidia.py Outdated Show resolved Hide resolved
Comment on lines 349 to 352
# set _supportsInstantPowerUsage for all GPUs
for gpu in self._gpus:
if gpu.getInstantPowerUsage() == "N/A":
gpu._supportsInstantPowerUsage = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. There's an int() cast inside getInstantPowerUsage. Wouldn't this raise an error? In any case, this API design isn't clean. A function with a return value of int should always return an int regardless of the circumstances. Otherwise this defeats our purpose of static type checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# set _supportsInstantPowerUsage for all GPUs
for gpu in self._gpus:
if gpu.getInstantPowerUsage() == "N/A":
gpu._supportsInstantPowerUsage = False
# set _supportsInstantPowerUsage for all GPUs
for gpu in self._gpus:
gpu._supportsInstantPowerUsage = isinstance(
amdsmi.amdsmi_get_power_info(gpu.handle)["current_socket_power"],
int,
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the output of amdsmi.amdsmi_get_power_info, and the value of current/average socket power is already int. So, the int() casting before returning is redundant. Please remove it.

zeus/device/gpu/amd.py Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
docs/measure/index.md Outdated Show resolved Hide resolved
zeus/device/gpu/amd.py Show resolved Hide resolved
@jaywonchung jaywonchung changed the title AMD integration amdsmi bindings integration Oct 31, 2024
docs/measure/index.md Outdated Show resolved Hide resolved
@jaywonchung
Copy link
Member

What is the status of this PR?

@jaywonchung jaywonchung changed the title amdsmi bindings integration [Feat] amdsmi bindings integration Nov 13, 2024
Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot!

@jaywonchung jaywonchung merged commit 0539e7e into master Nov 15, 2024
3 checks passed
@jaywonchung jaywonchung deleted the amd-integration branch November 15, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants