Default namespace is multi-nic-cni-operator
. If the namespace is not created, please run
kubectl create ns multi-nic-cni-operator
-
Set security policy to allow TCP communication on port
11001
on Host Primary Network -
Clone this repository and move to health-check directory
git clone https://github.com/foundation-model-stack/multi-nic-cni.git cd multi-nic-cni/health-check chmod +x ./checker/script.sh
-
Add
openshift.io/cluster-monitoring
label to multi-nic-cni-operator namespacekubectl label ns multi-nic-cni-operator openshift.io/cluster-monitoring=true
-
Run
4.1. (optional) specify node (agent) selector
- Modify
agentSelector
in./checker/configmap.yaml
4.2. run deploy script
-
Cluster with only single multinicnetwork
make deploy
-
Cluster with multiple multinicnetworks
# deploy health-check agents (used for all multinicnetwork) make deploy-agent # deploy checker (one deployment per multinicnetwork) export NETWORK_NAME=<target multinicnetwork name> ./checker/script.sh deploy ${NETWORK_NAME}
# expected output serviceaccount/multi-nic-cni-health-check-agent-sa created clusterrole.rbac.authorization.k8s.io/privileged-cr created clusterrolebinding.rbac.authorization.k8s.io/multi-nic-cni-privileged-cr-binding created daemonset.apps/multi-nic-cni-health-agent created serviceaccount/multi-nic-cni-health-checker-sa created clusterrole.rbac.authorization.k8s.io/multi-nic-cni-health-check created clusterrolebinding.rbac.authorization.k8s.io/multi-nic-cni-health-check-cr-binding created service/multi-nic-cni-health-check created servicemonitor.monitoring.coreos.com/multi-nic-cni-health-check created role.rbac.authorization.k8s.io/multi-nic-cni-health-check-prometheus created rolebinding.rbac.authorization.k8s.io/multi-nic-cni-health-check-prometheus created clusterrolebinding.rbac.authorization.k8s.io/insecure-prometheus-k8s created Set network name multi-nic-network deployment.apps/multi-nic-cni-health-checker created
- Modify
-
Check whether the health-checker and health-check-agent are running.
kubectl get po -n multi-nic-cni-operator
NAME READY STATUS RESTARTS AGE multi-nic-cni-health-agent-xx 1/1 Running 0 multi-nic-cni-health-agent-yy 1/1 Running 0 multi-nic-cni-health-checker-zz 1/1 Running 0
-
Test status service with port forward
# forward port on one terminal checker=$(kubectl get po -n multi-nic-cni-operator|grep multi-nic-cni-health-checker|awk '{ print $1 }') kubectl port-forward ${checker} -n multi-nic-cni-operator 8080:8080 # request the status check on another terminal. This request will activate the health check signal at the request time. curl localhost:8080/status
Example response:
{"HealthyHosts":["hostA", "hostB", "hostC"],"FailedInfo":[{"HostName":"hostD","Connectivity":{"192.168.0.0/18":false,"192.168.64.0/18":false},"Allocability":0,"StatusCode":...,"Status":...,"Message":...}],"CheckTime":"...","Checker":"checkerX"}
-
To get status of specific host:
# Example of healthy host > curl "localhost:8080/status?host=hostA" > {"Info":{"HostName":"hostA","Connectivity":{"192.168.0.0/18":true,"192.168.64.0/18":true},"Allocability":2,"StatusCode":200,"Status":"Success","Message":""},"CheckTime":"...","Checker":"checkerX"} # Example of unhealthy host > curl "localhost:8080/status?host=hostD" > {"Info":{"HostName":"hostD","Connectivity":{"192.168.0.0/18":false,"192.168.64.0/18":false},"Allocability":0,"StatusCode":...,"Status":...,"Message":...},"CheckTime":"...","Checker":"checkerX"}
-
-
Reload prometheus configuration
For OpenShift cluster with prometheus operator deployed in openshift-monitoring,
# forward port on one terminal kubectl port-forward prometheus-k8s-0 -n openshift-monitoring 9090:9090 # request reload on another terminal curl -X POST localhost:9090/-/reload
Remark: the manifest is prepared for OpenShift cluster where prometheus is deployed via operator in openshift-monitoring namespace. For exporting to prometheus on other namespace, please modify
./health-check/rbac.yaml
Then, you can check two exported metrics via prometheus query on
localhost:9090
.multi_nic_cni_connectivity: export one if the secondary network interface can be connected to the checker pod via specific network address. otherwise, export zero.
# total healthy connection observed on each host by specific checker sum(multi_nic_cni_connectivity) by (host, checker) # list failed connectivity multi_nic_cni_connectivity != 1
multi_nic_cni_allocability: export the number of secondary network interfaces that are successfully created and deleted by the CNI.
# healthy allocability observed on each host among all checker avg(multi_nic_cni_allocability) by (host) # list failed allocability avg(multi_nic_cni_allocability) by (host) != no. of secondary interfaces