-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redid handling of link down to ensure tear down of old path #583
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ktmi I read in the description your still working on it and with TODOs in the code. Overall, it's headed in the right direction. Yes, first and foremost we definitely need a local way of reproducing this either deterministically or with a high likelihood to assess with concrete data how it's performing in terms of correctness and failover convergence control plane and data plane switchover flows performance, and e2e test maybe we don't need fully exact one but one that simulates a closely double failure, and other cases like failing both current_path and failover_path when they're not fully disjoint.
Other than that, when you need a final review let me know. I did a partial review with some things that were a bit obvious, other parts I won't assume their status or what happened since I don't know if you'll still address based on what we discussed.
main.py
Outdated
): | ||
evcs_normal.append(evc) | ||
)): | ||
redeploy.append(evc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if evc.is_affected_by_link(link)
is True
and evc.is_failover_path_affected_by_link(link)
is True
failover_path needs to be cleaned too. Otherwise this can end up keeping a failover_path that isn't UP. Can you review this conditional and test it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's what the old logic was saying to do here. But I do see your point. Looking at the original code, it wouldn't ever tear down the failover path. Just tear down, then setup the current path and try to deploy a failover path if it doesn't already exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was another adjacent bug in the logic. It's the item 1.2. It would leave failover_path
not UP (with a link down in the path), and then a subsequent link_down in the current_path
would switchover to this failover_path
that isn't UP
Looking deeper into the code here, the old system wasn't ever making the vlans available when removing the old_path. |
Good finding @Ktmi. Looking on git history up until 2023.2.7, it was using intf_3 below had a failover_path over it, and intf_4 only the current_path. First, I simulated a link down on failover_path, so it got removed but s_vlan 1 wasn't made available. intf_4 did as expected.
On 2023.2.X versions though we still had issues with |
For redeploys, its looking like performing the redeploy process in bulk would be too difficult, however, I can get the undeploying of the EVC done, preparing it for a proper redeploy. |
Just finished writing some e2e tests for the consistency behaviour. You can find them at kytos-ng/kytos-end-to-end-tests#339. The current version of mef_eline fails all three of those tests. This PR only fails the last one, so I'll be hunting down the cause of that bug. |
Here are the E2E results with the new tests:
|
9cdd0d9
to
ef0596e
Compare
@viniarck This should be ready for a final review. Two things of note: one of the tests still needs a rewrite, as we added additionally functionality to link down handler that existing tests don't implement mocks for. Additionally, using the http endpoint for sending the flow mods didn't work out. E2E tests started to fail with that, so I reverted that change. |
- Ensure that entire operation is atomic, or as close to as reasonably possible - Ensure that the state of the EVC is properly progressed in link down scenarios - Ensure that vlans are cleared for removed paths.
ef0596e
to
1f46e55
Compare
Performed a rebase to make it easier to cherry pick commits. |
LGTM, I did some testing to see if the EVC locks are hold and seems to be working as intended. Great reworking of old_path management. |
Closes #517, #575
Summary
Replaces the procedure for handling of link down events to ensure that the old_path is torn down before fully committing changes to the database.
Local Tests
This still needs some tweaking, but the general shape of the solution is here. Will have to develop a test harness which can replicate the old issue reliably.
E2E Tests
Here's the latest E2E test report: