Add GPU metrics to Pod metrics for Job metadata #91

Fizzbb · 2022-02-02T20:45:45Z

Previous saved job metadata only includes aggregated CPU related utilization. Add GPU utilization and GPU memory usage in the data.
Procedures are like following
From each profiler daemon,

query GPU utilization through nvml, nvmlDeviceGetUtilizationRates, find the one with use.gpu >0
query prometheus metrics DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_GPU_UTIL, by gpu id, and instance
for those gpu util >0, get the process id on those gpus by nvmlDeviceGetComputeRunningProcesses
find the pod name of the given process with "nsenter --target --uts hostname"
save to a dict of each pods and its corresponding gpu util and mem used since pod start

Fizzbb assigned zliu374 Feb 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU metrics to Pod metrics for Job metadata #91

Add GPU metrics to Pod metrics for Job metadata #91

Fizzbb commented Feb 2, 2022

Add GPU metrics to Pod metrics for Job metadata #91

Add GPU metrics to Pod metrics for Job metadata #91

Comments

Fizzbb commented Feb 2, 2022