Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: add retry for anp/banp connection test #273

Closed
wants to merge 1 commit into from

Conversation

hongzhen-ma
Copy link

@hongzhen-ma hongzhen-ma commented Nov 15, 2024

add retry for anp/banp connection test since the policy need some time to take effect but the tests run check just once, which will lead to unnecessary failure

part of #108

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hongzhen-ma
Once this PR has been reviewed and has the lgtm label, please assign dyanngg for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @hongzhen-ma!

It looks like this is your first PR to kubernetes-sigs/network-policy-api 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/network-policy-api has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 15, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @hongzhen-ma. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 15, 2024
Copy link

netlify bot commented Nov 15, 2024

Deploy Preview for kubernetes-sigs-network-policy-api ready!

Name Link
🔨 Latest commit 0a1140e
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-network-policy-api/deploys/673ae68bb31ac6000859ba1c
😎 Deploy Preview https://deploy-preview-273--kubernetes-sigs-network-policy-api.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@hongzhen-ma hongzhen-ma marked this pull request as draft November 15, 2024 07:27
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 15, 2024
@hongzhen-ma hongzhen-ma marked this pull request as ready for review November 18, 2024 07:05
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2024
@aojea
Copy link
Contributor

aojea commented Nov 18, 2024

is this needed? I didn't see the test flake in the CI and adding retries for network policies means the pods will not be protected during that time, that is something bad for the users

@hongzhen-ma
Copy link
Author

hongzhen-ma commented Nov 18, 2024

is this needed? I didn't see the test flake in the CI and adding retries for network policies means the pods will not be protected during that time, that is something bad for the users

is this needed?
--- We did encounter this problem in e2e test for Kube-OVN, so we want to add a retry. After add retry, the result succeed in our e2e test.
Of course, we will also look at the problems in our implementation and try to solve it.

Another point is that retry is about connection results after an anp or a banp applied, and does not modify policy rules during
the retry check process. So it would not affect the restrictions on Pods imposed by policy rules.

@aojea
Copy link
Contributor

aojea commented Nov 18, 2024

Of course, we will also look at the problems in our implementation and try to solve it.

Modifying tests that may hide problems lowers the quality bar ... the tests are done to assert on behaviors not to just pass,

@hongzhen-ma
Copy link
Author

Of course, we will also look at the problems in our implementation and try to solve it.

Modifying tests that may hide problems lowers the quality bar ... the tests are done to assert on behaviors not to just pass,

Thanks for the reply.
We will troubleshoot implementation issues firstly.

@fasaxc
Copy link

fasaxc commented Nov 19, 2024

FWIW, I've seen some flakes on Calico too. Haven't analysed them in detail yet but there are classes of updates that can race with connectivity checks. Calico's model is as follows:

  • All connectivity is blocked by default.
  • Calico learns about the pod, programs the current known policy for that pod atomically.
  • When policy or selector matches change, Calico updates the dataplane incrementally. We make updates in a way that avoids flapping to open.

Some examples of races with that model:

  • If you have an allow rule and switch it to deny, an immediate connectivity check might still see allow but after a few ms it should go to deny and stay that way.
  • Similarly, if you have an allow rule and you add a higher precedence deny.
  • If you have an "deny to selector" rule and you add a new pod that matches the selector, it takes some short time to add the pod to the deny list in the dataplane (which is why deny to selector rules are sub-optimal!). After adding a pod, you'll need to retry the check to avoid races.

I expect that kube-network-policy has the lowest latency in these areas because it only needs to get the update to user-space, not program the kernel too.

@aojea
Copy link
Contributor

aojea commented Nov 19, 2024

We discussed this in slack, and just to clarify, and apoligize if this came wrong, I'm not advocating for zero latency, but just before bumping timeouts to have a better understanding on the problem, it most probably bias, as it happens a lot to me that the immediate solution for a flake is adding a sleep or a retry, and a considerable number of times there is a bug that we ignore.

In in k/k there is 10 s IIUIC https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/netpol/kubemanager.go#L328-L350

I think ANP has 3 seconds for the tests

func DefaultTimeoutConfig() TimeoutConfig {
return TimeoutConfig{
CreateTimeout: 60 * time.Second,
DeleteTimeout: 20 * time.Second,
GetTimeout: 20 * time.Second,
ManifestFetchTimeout: 10 * time.Second,
NamespacesMustBeReady: 300 * time.Second,
RequestTimeout: 3 * time.Second,
}

I'm +1 on adapting the tests to allow 10s for the progamming latency as it is the same we have in k/k, but I really think we should not go beyond that , sounds fair?

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Nov 19, 2024
@k8s-ci-robot
Copy link
Contributor

@aojea: Reopened this PR.

In response to this:

We discussed this in slack, and just to clarify, and apoligize if this came wrong, I'm not advocating for zero latency, but just before bumping timeouts to have a better understanding on the problem, it most probably bias, as it happens a lot to me that the immediate solution for a flake is adding a sleep or a retry, and a considerable number of times there is a bug that we ignore.

In in k/k there is 10 s IIUIC https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/netpol/kubemanager.go#L328-L350

I think ANP has 3 seconds for the tests

func DefaultTimeoutConfig() TimeoutConfig {
return TimeoutConfig{
CreateTimeout: 60 * time.Second,
DeleteTimeout: 20 * time.Second,
GetTimeout: 20 * time.Second,
ManifestFetchTimeout: 10 * time.Second,
NamespacesMustBeReady: 300 * time.Second,
RequestTimeout: 3 * time.Second,
}

I'm +1 on adapting the tests to allow 10s for the progamming latency as it is the same we have in k/k, but I really think we should not go beyond that , sounds fair?

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hongzhen-ma
Copy link
Author

In in k/k there is 10 s IIUIC https://github.com/kubernetes/kubernetes/blob/master/test/e2e/network/netpol/kubemanager.go#L328-L350

Thanks for the information.
I will first look at the implementation of netpol test case, and then come back to see how to modify this test

@fasaxc
Copy link

fasaxc commented Nov 20, 2024

@aojea yep, 100% agree. Calico's FV tests also use 10s timeout for connectivity/policy convergence.

I think the best strategy I've found for these kinds of tests is to measure convergence time (with a a very long abort timeout) and then assert that the time taken was less than a "reasonable" time.

Then you get one of a three results instead of just pass/fail:

  • Pass: converged and in the expected time
  • Fail: converged but it took 30s instead of 10s
  • Fail: did not converge, even after <long timeout>.

Being able to distinguish "converged but took too longer than we want" from "did not converge after an unreasonably long time" makes the debugging a lot easier.

@aojea
Copy link
Contributor

aojea commented Nov 20, 2024

Then you get one of a three results instead of just pass/fail:

  • Pass: converged and in the expected time
  • Fail: converged but it took 30s instead of 10s
  • Fail: did not converge, even after <long timeout>.

I love that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants