Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporate Zeusd for CPU and DRAM monitoring in ZeusMonitor #150

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

michahn01
Copy link

This PR addresses this issue: #146

Overview of changes:

  • Add class ZeusdRAPLCPU to zeus/device/cpu/rapl.py; ZeusdRAPLCPU communicates with Zeusd to interface with RAPL indirectly, removing the need for root privileges
  • Modify RAPLCPUs in zeus/device/cpu/rapl.py to utilize ZeusdRAPLCPU rather than RAPLCPU if Zeusd is found on the expected socket
  • Added an endpoint in zeusd/src/routes/cpu.rs for checking if DRAM energy monitoring is available
  • Made some small changes to zeusd/src/devices/cpu/mod.rs to support the above endpoint.
  • Added a test in zeusd/tests/cpu.rs to verify the endpoint works properly.

@jaywonchung
Copy link
Member

@michahn01 Thanks for your work! CI is failing; please look into the CI logs and get them to pass. You can also check the exact command being run for each workflow file in .github/workflows/.

@jaywonchung jaywonchung linked an issue Jan 26, 2025 that may be closed by this pull request
@michahn01
Copy link
Author

@michahn01 Thanks for your work! CI is failing; please look into the CI logs and get them to pass. You can also check the exact command being run for each workflow file in .github/workflows/.

Got it! The issues were fixed, I believe CI should pass now.

Copy link
Member

@jaywonchung jaywonchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work! You nailed most parts on your first contribution, too! I think it's good to go after some minor changes.

@wbjin Could you take a look at this PR as well?

zeus/device/cpu/rapl.py Show resolved Hide resolved
def supportsGetDramEnergyConsumption(self) -> bool:
"""Returns True if the specified CPU powerzone supports retrieving the subpackage energy consumption."""
resp = self._client.get(
self._url_prefix + "/supportsDramEnergy",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All other routes use camel case for the path. Please make this camel case as well.

Comment on lines 304 to 307
cpu_uj = data.get("cpu_energy_uj")
dram_uj = data.get("dram_energy_uj")
cpu_mj = None if cpu_uj is None else cpu_uj / 1000
dram_mj = None if dram_uj is None else dram_uj / 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wbjin Quick question -- Here we've constructed the request with cpu and dram both set to True. In that case, can the response cpu_energy_uj ever be None? Because if there was some error from the daemon-side, the error would be propagated back to the Python client, so the status code won't be 200. On the other hand, dram may actually be None because the underlying RAPL device may not support DRAM measurement. Just wanted to confirm this with you.

Assuming that the above is true, cpu_energy_uj somehow ending up as a None means something went wrong. Zeus should error out.

Suggested change
cpu_uj = data.get("cpu_energy_uj")
dram_uj = data.get("dram_energy_uj")
cpu_mj = None if cpu_uj is None else cpu_uj / 1000
dram_mj = None if dram_uj is None else dram_uj / 1000
cpu_mj = data["cpu_energy_uj"] / 1000
dram_uj = data.get("dram_energy_uj")
dram_mj = dram_uj / 1000 if dram_uj is not None else None

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing that out -- this was my mistake. I don't think cpu_energy_uj can ever be None here if zeusd is working correctly. But @wbjin please confirm if this sounds right!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might go even further here. In the constructor, we can call supportsGetDramEnergyConsumption (via the daemon) once and cache the result inside the class. Subsequent calls to supportsGetDramEnergyConsumption will return the cached value. Also, in getTotalEnergyConsumption, we would be able to know whether None is a valid response value for the dram field.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, if cpu and dram are set to be both True in the request, cpu_energy_uj shouldn't be None while dram_energy_uj can be None if dram isn't supported.

@@ -35,6 +35,19 @@ pub struct RaplResponse {
pub dram_energy_uj: Option<u64>,
}

#[derive(Serialize, Deserialize, Debug)]
pub struct DramResponse {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to DramAvailabilityResponse.

Comment on lines 43 to 50
/// Unified CPU response type
#[derive(Serialize, Deserialize, Debug)]
#[serde(untagged)]
pub enum CpuResponse {
Rapl(RaplResponse),
Dram(DramResponse),
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very clean. Good job!

Comment on lines +52 to +57
#[tracing::instrument(
skip(cpu_id, _device_tasks),
fields(
cpu_id = %cpu_id,
)
)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like cpu_id shouldn't be in the skip list. Applies to other route handlers as well. @wbjin WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would be useful to be able to see the cpu_id for the request. I think the same applies for /src/routes/gpu.rs gpu routes as well, it skips the gpu_id

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Let me handle gpu_ids in a separate PR.

import zeus.device.cpu.common as cpu_common
from zeus.device.cpu.common import CpuDramMeasurement
from zeus.device.exception import ZeusBaseCPUError
from zeus.device.exception import ZeusdError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge this with the line above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorporate Zeusd for CPU and DRAM monitoring in ZeusMonitor
3 participants