Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[occm] Improve route controller reconciling to ensure the cluster's nodes can access each other #2484

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jeffyjf
Copy link
Contributor

@jeffyjf jeffyjf commented Nov 28, 2023

What this PR does / why we need it:

In order to the nodes of one cluster can access each other, route controller need to check and set three things:

  1. The openstack router's route rules, so that the packets can be forwarded to correct nodes.
  2. The node port's AllowAddressPair, to ensure the node permit the packets that access the node's pods/services pass through.
  3. The openstack security group's rules, so that the nodes that bind the security group permit the packets from other nodes enter into.

The current codes just check router's route rule when controller call ListRoutes and just set Route and AllowAddressPair when controller call CreateRoute. This PR complements all of the other works.

Which issue this PR fixes(if applicable):
fixes #2482

Special notes for reviewers:

This PR lack of unit tests, because of I found occm lack of a mechanism to mock openstack client. Further more the whole occm lack of lots of unit tests due to the reason. I plan to dive deeper into gophercloud in the next several days try to study whether it is possible to mock it. If it is possible I'll commit anohter PR to add the unit tests.

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 28, 2023
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 28, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @jeffyjf. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 28, 2023
@dulek
Copy link
Contributor

dulek commented Nov 28, 2023

Why did this ended up in Routes controller? The issue description feels like this is a general issue in cases CAPO is missing SG rules to allow interconnectivity of Node.

pkg/openstack/routes.go Outdated Show resolved Hide resolved
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Nov 29, 2023

Why did this ended up in Routes controller? The issue description feels like this is a general issue in cases CAPO is missing SG rules to allow interconnectivity of Node.

According to the offical document:

The route controller is responsible for configuring routes in the cloud appropriately so that containers on different nodes in your Kubernetes cluster can communicate with each other.

I think that this is route controller's duty. I user don't activate route controller, the interconnectivity of pods should be ensured by other mechanisms.

@jichenjc
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 29, 2023
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from 6b498a6 to f9ab6ef Compare December 6, 2023 03:35
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 6, 2023
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch 3 times, most recently from 3c6af74 to d639248 Compare December 13, 2023 01:48
@jeffyjf jeffyjf changed the title [occm][WIP] Add SG to node ensure the different node's pods can access each other [occm] Add SG to node ensure the different node's pods can access each other Dec 13, 2023
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Dec 13, 2023

@dulek @jichenjc @kayrus @zetaab asking for review

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 13, 2023
@kayrus
Copy link
Contributor

kayrus commented Dec 13, 2023

The PR needs a rebase. However the #2499 is on the way, and I adopted your getSecurityGroupRules change there as well.

@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from d639248 to eaaea60 Compare December 14, 2023 01:12
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Dec 14, 2023

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 14, 2023
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from eaaea60 to 35f8dca Compare December 14, 2023 03:00
@jeffyjf jeffyjf changed the title [occm] Add SG to node ensure the different node's pods can access each other [occm] Improve route controller reconciling to ensure the cluster's nodes can access each other Dec 14, 2023
@jeffyjf
Copy link
Contributor Author

jeffyjf commented Dec 14, 2023

/remove hold

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 14, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 14, 2024
@dulek
Copy link
Contributor

dulek commented May 23, 2024

/remove-lifecycle rotten
/lgtm

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 23, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 23, 2024
Copy link
Contributor

@kayrus kayrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR. Could you take a look at comments I just left?

pkg/openstack/openstack.go Outdated Show resolved Hide resolved
pkg/openstack/routes.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 3, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 1, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle rotten
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 31, 2024
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 15, 2024
Copy link
Contributor

@kayrus kayrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffyjf could you please rebase + see code review comments?

pkg/openstack/routes.go Outdated Show resolved Hide resolved
Comment on lines +147 to +149
if r.allowedAddressPairs {
// check whether the related AllowAddressPair is existing
for _, addrPair := range port.AllowedAddressPairs {
if addrPair.IPAddress == route.DestinationCIDR {
return true
}
}
return false
}
return true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if r.allowedAddressPairs {
// check whether the related AllowAddressPair is existing
for _, addrPair := range port.AllowedAddressPairs {
if addrPair.IPAddress == route.DestinationCIDR {
return true
}
}
return false
}
return true
// check whether the related AllowAddressPair is existing
if r.allowedAddressPairs && !slices.Contains(port.AllowedAddressPairs, route.DestinationCIDR) {
return false
}
return true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port.AllowedAddressPairs is a struct object while route.DestinationCIDR is a string object. So, slices.Contains cannot directly be applied.

pkg/openstack/routes.go Outdated Show resolved Hide resolved
r.Lock()
defer r.Unlock()
mc := metrics.NewMetricContext("router", "get")
router, err := routers.Get(r.network, r.os.routeOpts.RouterID).Extract()
Copy link
Contributor

@kayrus kayrus Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're adding a complexity here. before, the router API was called only once when atomic API was not available. Now this API is called even atomic method is available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think this is necessary. The check step is essential for reconciling. This can ensure CPO can repair a wider range of route issues. For example, if a user accidentally and manually deletes a route entity, it can be repaired simply by restarting CPO.

@jeffyjf
Copy link
Contributor Author

jeffyjf commented Nov 18, 2024

@jeffyjf could you please rebase + see code review comments?

@kayrus Thanks for your review. I will do these as soon as possible.

@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from ee22e72 to 8349f32 Compare November 18, 2024 02:24
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 18, 2024
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 18, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from dulek. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from 8349f32 to 75e7331 Compare November 18, 2024 03:08
…onnectivity

In oreder to ensure the connectivity between different nodes, route controller need
to do the following things:

1. Check and set openstack router's route rules, so that the packets can be
   forwarded to correct nodes.
2. Check and set the node port's AllowAddressPair, to permit the packets from
   the pods pass through the node's port and leave the node.
3. Check and set openstack security group's rules, so that the nodes that bind the
   security group permit the packets from other nodes enter into.

But, the previous occm codes just check router's route rules, just set router's
route and AllowAddressPair, this patch completed the other steps.
@jeffyjf jeffyjf force-pushed the route-controller-security-group branch from 75e7331 to 13d7b04 Compare November 18, 2024 03:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[occm] The pods cannot access each other when across nodes
6 participants