Support device health-check #362

adrianchiris · 2021-07-13T15:06:35Z

What would you like to be added?

Support periodically checking for device health and notifying kubelet on changes to devices via ListAndWatch rpc call

What is the use case for this feature / enhancement?

devices may become un-healthy, e.g a resource was consumed by workload during which it has become corrupted. we should report this to kubelet to avoid requests for this device for future workloads.

https://github.com/kubernetes/kubernetes/blob/234d7311822aecb8c5f4115107007b8420d9316b/staging/src/k8s.io/kubelet/pkg/apis/deviceplugin/v1beta1/api.proto#L58

The text was updated successfully, but these errors were encountered:

TothFerenc · 2021-07-13T15:48:30Z

Isn't it a bug as it is mentioned as a supported feature?: https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin#features

ipatrykx · 2021-07-21T11:04:12Z

It seems that there is some handler for updateSignal, but I assume that has to be issued by kubelet (?). Do you want to make DP 'proactively' scan the devices health status and then pass that info to kubelet on it's own?

The other question for me is what about the plans to make DP to track the devices (like in the issue 276) - should the DP then also track the health status of the 'consumed' devices? I am wondering is that even achievable as the devices are moved to the container's namespace?

adrianchiris · 2021-07-27T12:45:01Z

Isn't it a bug as it is mentioned as a supported feature?:

Maybe a documentation bug :) , i dont remember having this logic in DP.

@ipatrykx i think we should first define what is a healthy device.

a good start IMO is: a device is considered healthy if all relevant resources for that device are present in the system.
I am unsure how to check this for allocated devices.

adrianchiris added the enhancement New feature or request label Jul 13, 2021

adrianchiris mentioned this issue Jul 13, 2021

PF/VF health monitoring #363

Closed

ian-howell mentioned this issue Mar 29, 2023

Add selectors nicNames #434

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support device health-check #362

Support device health-check #362

adrianchiris commented Jul 13, 2021

TothFerenc commented Jul 13, 2021

ipatrykx commented Jul 21, 2021

adrianchiris commented Jul 27, 2021

Support device health-check #362

Support device health-check #362

Comments

adrianchiris commented Jul 13, 2021

What would you like to be added?

What is the use case for this feature / enhancement?

TothFerenc commented Jul 13, 2021

ipatrykx commented Jul 21, 2021

adrianchiris commented Jul 27, 2021