You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support periodically checking for device health and notifying kubelet on changes to devices via ListAndWatch rpc call
What is the use case for this feature / enhancement?
devices may become un-healthy, e.g a resource was consumed by workload during which it has become corrupted. we should report this to kubelet to avoid requests for this device for future workloads.
It seems that there is some handler for updateSignal, but I assume that has to be issued by kubelet (?). Do you want to make DP 'proactively' scan the devices health status and then pass that info to kubelet on it's own?
The other question for me is what about the plans to make DP to track the devices (like in the issue 276) - should the DP then also track the health status of the 'consumed' devices? I am wondering is that even achievable as the devices are moved to the container's namespace?
Isn't it a bug as it is mentioned as a supported feature?:
Maybe a documentation bug :) , i dont remember having this logic in DP.
@ipatrykx i think we should first define what is a healthy device.
a good start IMO is: a device is considered healthy if all relevant resources for that device are present in the system.
I am unsure how to check this for allocated devices.
What would you like to be added?
Support periodically checking for device health and notifying kubelet on changes to devices via
ListAndWatch
rpc callWhat is the use case for this feature / enhancement?
devices may become un-healthy, e.g a resource was consumed by workload during which it has become corrupted. we should report this to kubelet to avoid requests for this device for future workloads.
https://github.com/kubernetes/kubernetes/blob/234d7311822aecb8c5f4115107007b8420d9316b/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.proto#L58
The text was updated successfully, but these errors were encountered: