test: add retry for anp/banp connection test #273

hongzhen-ma · 2024-11-15T02:46:03Z

add retry for anp/banp connection test since the policy need some time to take effect but the tests run check just once, which will lead to unnecessary failure

part of #108

k8s-ci-robot · 2024-11-15T02:46:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongzhen-ma
Once this PR has been reviewed and has the lgtm label, please assign dyanngg for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-11-15T02:46:12Z

Welcome @hongzhen-ma!

It looks like this is your first PR to kubernetes-sigs/network-policy-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/network-policy-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-11-15T02:46:13Z

Hi @hongzhen-ma. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2024-11-15T02:46:45Z

✅ Deploy Preview for kubernetes-sigs-network-policy-api ready!

Name	Link
🔨 Latest commit	`0a1140e`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-network-policy-api/deploys/673ae68bb31ac6000859ba1c
😎 Deploy Preview	https://deploy-preview-273--kubernetes-sigs-network-policy-api.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Signed-off-by: 马洪贞 <[email protected]>

aojea · 2024-11-18T08:14:29Z

is this needed? I didn't see the test flake in the CI and adding retries for network policies means the pods will not be protected during that time, that is something bad for the users

hongzhen-ma · 2024-11-18T09:18:40Z

is this needed? I didn't see the test flake in the CI and adding retries for network policies means the pods will not be protected during that time, that is something bad for the users

is this needed?
--- We did encounter this problem in e2e test for Kube-OVN, so we want to add a retry. After add retry, the result succeed in our e2e test.
Of course, we will also look at the problems in our implementation and try to solve it.

Another point is that retry is about connection results after an anp or a banp applied, and does not modify policy rules during
the retry check process. So it would not affect the restrictions on Pods imposed by policy rules.

aojea · 2024-11-18T11:57:33Z

Of course, we will also look at the problems in our implementation and try to solve it.

Modifying tests that may hide problems lowers the quality bar ... the tests are done to assert on behaviors not to just pass,

hongzhen-ma · 2024-11-19T01:53:38Z

Of course, we will also look at the problems in our implementation and try to solve it.

Modifying tests that may hide problems lowers the quality bar ... the tests are done to assert on behaviors not to just pass,

Thanks for the reply.
We will troubleshoot implementation issues firstly.

fasaxc · 2024-11-19T18:04:54Z

FWIW, I've seen some flakes on Calico too. Haven't analysed them in detail yet but there are classes of updates that can race with connectivity checks. Calico's model is as follows:

All connectivity is blocked by default.
Calico learns about the pod, programs the current known policy for that pod atomically.
When policy or selector matches change, Calico updates the dataplane incrementally. We make updates in a way that avoids flapping to open.

Some examples of races with that model:

If you have an allow rule and switch it to deny, an immediate connectivity check might still see allow but after a few ms it should go to deny and stay that way.
Similarly, if you have an allow rule and you add a higher precedence deny.
If you have an "deny to selector" rule and you add a new pod that matches the selector, it takes some short time to add the pod to the deny list in the dataplane (which is why deny to selector rules are sub-optimal!). After adding a pod, you'll need to retry the check to avoid races.

I expect that kube-network-policy has the lowest latency in these areas because it only needs to get the update to user-space, not program the kernel too.

aojea · 2024-11-19T18:52:50Z

We discussed this in slack, and just to clarify, and apoligize if this came wrong, I'm not advocating for zero latency, but just before bumping timeouts to have a better understanding on the problem, it most probably bias, as it happens a lot to me that the immediate solution for a flake is adding a sleep or a retry, and a considerable number of times there is a bug that we ignore.

In in k/k there is 10 s IIUIC https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/netpol/kubemanager.go#L328-L350

I think ANP has 3 seconds for the tests

network-policy-api/conformance/utils/config/timeout.go

Lines 50 to 58 in b1d9608

    
           func DefaultTimeoutConfig() TimeoutConfig { 
        
           	return TimeoutConfig{ 
        
           		CreateTimeout:         60 * time.Second, 
        
           		DeleteTimeout:         20 * time.Second, 
        
           		GetTimeout:            20 * time.Second, 
        
           		ManifestFetchTimeout:  10 * time.Second, 
        
           		NamespacesMustBeReady: 300 * time.Second, 
        
           		RequestTimeout:        3 * time.Second, 
        
           	}

I'm +1 on adapting the tests to allow 10s for the progamming latency as it is the same we have in k/k, but I really think we should not go beyond that , sounds fair?

/reopen

k8s-ci-robot · 2024-11-19T18:52:55Z

@aojea: Reopened this PR.

In response to this:

We discussed this in slack, and just to clarify, and apoligize if this came wrong, I'm not advocating for zero latency, but just before bumping timeouts to have a better understanding on the problem, it most probably bias, as it happens a lot to me that the immediate solution for a flake is adding a sleep or a retry, and a considerable number of times there is a bug that we ignore.

In in k/k there is 10 s IIUIC https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/netpol/kubemanager.go#L328-L350

I think ANP has 3 seconds for the tests

network-policy-api/conformance/utils/config/timeout.go

Lines 50 to 58 in b1d9608

func DefaultTimeoutConfig() TimeoutConfig {

return TimeoutConfig{

CreateTimeout: 60 * time.Second,

DeleteTimeout: 20 * time.Second,

GetTimeout: 20 * time.Second,

ManifestFetchTimeout: 10 * time.Second,

NamespacesMustBeReady: 300 * time.Second,

RequestTimeout: 3 * time.Second,

}

I'm +1 on adapting the tests to allow 10s for the progamming latency as it is the same we have in k/k, but I really think we should not go beyond that , sounds fair?

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hongzhen-ma · 2024-11-20T08:47:49Z

In in k/k there is 10 s IIUIC https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/netpol/kubemanager.go#L328-L350

Thanks for the information.
I will first look at the implementation of netpol test case, and then come back to see how to modify this test

fasaxc · 2024-11-20T11:55:31Z

@aojea yep, 100% agree. Calico's FV tests also use 10s timeout for connectivity/policy convergence.

I think the best strategy I've found for these kinds of tests is to measure convergence time (with a a very long abort timeout) and then assert that the time taken was less than a "reasonable" time.

Then you get one of a three results instead of just pass/fail:

Pass: converged and in the expected time
Fail: converged but it took 30s instead of 10s
Fail: did not converge, even after <long timeout>.

Being able to distinguish "converged but took too longer than we want" from "did not converge after an unreasonably long time" makes the debugging a lot easier.

aojea · 2024-11-20T12:02:21Z

Then you get one of a three results instead of just pass/fail:

Pass: converged and in the expected time

Fail: converged but it took 30s instead of 10s

Fail: did not converge, even after <long timeout>.

I love that

k8s-ci-robot requested a review from danwinship November 15, 2024 02:46

k8s-ci-robot requested a review from Dyanngg November 15, 2024 02:46

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 15, 2024

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 15, 2024

hongzhen-ma force-pushed the e2e-retry branch from 5be85c8 to 32ed979 Compare November 15, 2024 07:08

hongzhen-ma marked this pull request as draft November 15, 2024 07:27

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 15, 2024

test: add retry for anp/banp connection test

0a1140e

Signed-off-by: 马洪贞 <[email protected]>

hongzhen-ma force-pushed the e2e-retry branch from 32ed979 to 0a1140e Compare November 18, 2024 07:02

hongzhen-ma marked this pull request as ready for review November 18, 2024 07:05

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2024

k8s-ci-robot requested review from aojea and astoycos November 18, 2024 07:05

hongzhen-ma closed this Nov 19, 2024

k8s-ci-robot reopened this Nov 19, 2024

hongzhen-ma closed this Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: add retry for anp/banp connection test #273

test: add retry for anp/banp connection test #273

hongzhen-ma commented Nov 15, 2024 •

edited

Loading

k8s-ci-robot commented Nov 15, 2024

k8s-ci-robot commented Nov 15, 2024

k8s-ci-robot commented Nov 15, 2024

netlify bot commented Nov 15, 2024 •

edited

Loading

aojea commented Nov 18, 2024

hongzhen-ma commented Nov 18, 2024 •

edited

Loading

aojea commented Nov 18, 2024

hongzhen-ma commented Nov 19, 2024

fasaxc commented Nov 19, 2024

aojea commented Nov 19, 2024

k8s-ci-robot commented Nov 19, 2024

hongzhen-ma commented Nov 20, 2024

fasaxc commented Nov 20, 2024

aojea commented Nov 20, 2024

test: add retry for anp/banp connection test #273

test: add retry for anp/banp connection test #273

Conversation

hongzhen-ma commented Nov 15, 2024 • edited Loading

k8s-ci-robot commented Nov 15, 2024

k8s-ci-robot commented Nov 15, 2024

k8s-ci-robot commented Nov 15, 2024

netlify bot commented Nov 15, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-network-policy-api ready!

aojea commented Nov 18, 2024

hongzhen-ma commented Nov 18, 2024 • edited Loading

aojea commented Nov 18, 2024

hongzhen-ma commented Nov 19, 2024

fasaxc commented Nov 19, 2024

aojea commented Nov 19, 2024

k8s-ci-robot commented Nov 19, 2024

hongzhen-ma commented Nov 20, 2024

fasaxc commented Nov 20, 2024

aojea commented Nov 20, 2024

hongzhen-ma commented Nov 15, 2024 •

edited

Loading

netlify bot commented Nov 15, 2024 •

edited

Loading

hongzhen-ma commented Nov 18, 2024 •

edited

Loading