-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorporate Zeusd for CPU and DRAM monitoring in ZeusMonitor #150
base: master
Are you sure you want to change the base?
Incorporate Zeusd for CPU and DRAM monitoring in ZeusMonitor #150
Conversation
@michahn01 Thanks for your work! CI is failing; please look into the CI logs and get them to pass. You can also check the exact command being run for each workflow file in |
Got it! The issues were fixed, I believe CI should pass now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work! You nailed most parts on your first contribution, too! I think it's good to go after some minor changes.
@wbjin Could you take a look at this PR as well?
zeus/device/cpu/rapl.py
Outdated
def supportsGetDramEnergyConsumption(self) -> bool: | ||
"""Returns True if the specified CPU powerzone supports retrieving the subpackage energy consumption.""" | ||
resp = self._client.get( | ||
self._url_prefix + "/supportsDramEnergy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All other routes use camel case for the path. Please make this camel case as well.
zeus/device/cpu/rapl.py
Outdated
cpu_uj = data.get("cpu_energy_uj") | ||
dram_uj = data.get("dram_energy_uj") | ||
cpu_mj = None if cpu_uj is None else cpu_uj / 1000 | ||
dram_mj = None if dram_uj is None else dram_uj / 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wbjin Quick question -- Here we've constructed the request with cpu
and dram
both set to True
. In that case, can the response cpu_energy_uj
ever be None
? Because if there was some error from the daemon-side, the error would be propagated back to the Python client, so the status code won't be 200. On the other hand, dram
may actually be None
because the underlying RAPL device may not support DRAM measurement. Just wanted to confirm this with you.
Assuming that the above is true, cpu_energy_uj
somehow ending up as a None
means something went wrong. Zeus should error out.
cpu_uj = data.get("cpu_energy_uj") | |
dram_uj = data.get("dram_energy_uj") | |
cpu_mj = None if cpu_uj is None else cpu_uj / 1000 | |
dram_mj = None if dram_uj is None else dram_uj / 1000 | |
cpu_mj = data["cpu_energy_uj"] / 1000 | |
dram_uj = data.get("dram_energy_uj") | |
dram_mj = dram_uj / 1000 if dram_uj is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing that out -- this was my mistake. I don't think cpu_energy_uj
can ever be None here if zeusd is working correctly. But @wbjin please confirm if this sounds right!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might go even further here. In the constructor, we can call supportsGetDramEnergyConsumption
(via the daemon) once and cache the result inside the class. Subsequent calls to supportsGetDramEnergyConsumption
will return the cached value. Also, in getTotalEnergyConsumption
, we would be able to know whether None
is a valid response value for the dram
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, if cpu
and dram
are set to be both True
in the request, cpu_energy_uj
shouldn't be None
while dram_energy_uj
can be None
if dram isn't supported.
zeusd/src/devices/cpu/mod.rs
Outdated
@@ -35,6 +35,19 @@ pub struct RaplResponse { | |||
pub dram_energy_uj: Option<u64>, | |||
} | |||
|
|||
#[derive(Serialize, Deserialize, Debug)] | |||
pub struct DramResponse { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to DramAvailabilityResponse
.
zeusd/src/devices/cpu/mod.rs
Outdated
/// Unified CPU response type | ||
#[derive(Serialize, Deserialize, Debug)] | ||
#[serde(untagged)] | ||
pub enum CpuResponse { | ||
Rapl(RaplResponse), | ||
Dram(DramResponse), | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very clean. Good job!
#[tracing::instrument( | ||
skip(cpu_id, _device_tasks), | ||
fields( | ||
cpu_id = %cpu_id, | ||
) | ||
)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like cpu_id
shouldn't be in the skip
list. Applies to other route handlers as well. @wbjin WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would be useful to be able to see the cpu_id
for the request. I think the same applies for /src/routes/gpu.rs
gpu routes as well, it skips the gpu_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Let me handle gpu_id
s in a separate PR.
zeus/device/cpu/rapl.py
Outdated
import zeus.device.cpu.common as cpu_common | ||
from zeus.device.cpu.common import CpuDramMeasurement | ||
from zeus.device.exception import ZeusBaseCPUError | ||
from zeus.device.exception import ZeusdError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please merge this with the line above.
This PR addresses this issue: #146
Overview of changes:
ZeusdRAPLCPU
tozeus/device/cpu/rapl.py
;ZeusdRAPLCPU
communicates with Zeusd to interface with RAPL indirectly, removing the need for root privilegesRAPLCPUs
inzeus/device/cpu/rapl.py
to utilizeZeusdRAPLCPU
rather thanRAPLCPU
if Zeusd is found on the expected socketzeusd/src/routes/cpu.rs
for checking if DRAM energy monitoring is availablezeusd/src/devices/cpu/mod.rs
to support the above endpoint.zeusd/tests/cpu.rs
to verify the endpoint works properly.