You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previous saved job metadata only includes aggregated CPU related utilization. Add GPU utilization and GPU memory usage in the data.
Procedures are like following
From each profiler daemon,
query GPU utilization through nvml, nvmlDeviceGetUtilizationRates, find the one with use.gpu >0
query prometheus metrics DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_GPU_UTIL, by gpu id, and instance
for those gpu util >0, get the process id on those gpus by nvmlDeviceGetComputeRunningProcesses
find the pod name of the given process with "nsenter --target --uts hostname"
save to a dict of each pods and its corresponding gpu util and mem used since pod start
The text was updated successfully, but these errors were encountered:
Previous saved job metadata only includes aggregated CPU related utilization. Add GPU utilization and GPU memory usage in the data.
Procedures are like following
From each profiler daemon,
The text was updated successfully, but these errors were encountered: