Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redid handling of link down to ensure tear down of old path #583

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Ktmi
Copy link

@Ktmi Ktmi commented Dec 4, 2024

Closes #517, #575

Summary

Replaces the procedure for handling of link down events to ensure that the old_path is torn down before fully committing changes to the database.

Local Tests

This still needs some tweaking, but the general shape of the solution is here. Will have to develop a test harness which can replicate the old issue reliably.

E2E Tests

Here's the latest E2E test report:

kytos-1  | Starting enhanced syslogd: rsyslogd.
kytos-1  | /etc/openvswitch/conf.db does not exist ... (warning).
kytos-1  | Creating empty database /etc/openvswitch/conf.db.
kytos-1  | Starting ovsdb-server.
kytos-1  | rsyslogd: error during config processing: omfile: chown for file '/var/log/syslog' failed: Operation not permitted [v8.2302.0 try https://www.rsyslog.com/e/2207 ]
kytos-1  | Configuring Open vSwitch system IDs.
kytos-1  | Starting ovs-vswitchd.
kytos-1  | Enabling remote OVSDB managers.
kytos-1  | There is no NAPPS_PATH specified. Default will be used.
kytos-1  | + '[' -z '' ']'
kytos-1  | + '[' -z '' ']'
kytos-1  | + echo 'There is no NAPPS_PATH specified. Default will be used.'
kytos-1  | + NAPPS_PATH=
kytos-1  | + sed -i 's/STATS_INTERVAL = 60/STATS_INTERVAL = 7/g' /var/lib/kytos/napps/kytos/of_core/settings.py
kytos-1  | + sed -i 's/CONSISTENCY_MIN_VERDICT_INTERVAL =.*/CONSISTENCY_MIN_VERDICT_INTERVAL = 60/g' /var/lib/kytos/napps/kytos/flow_manager/settings.py
kytos-1  | + sed -i 's/LINK_UP_TIMER = 10/LINK_UP_TIMER = 1/g' /var/lib/kytos/napps/kytos/topology/settings.py
kytos-1  | + sed -i 's/DEPLOY_EVCS_INTERVAL = 60/DEPLOY_EVCS_INTERVAL = 5/g' /var/lib/kytos/napps/kytos/mef_eline/settings.py
kytos-1  | + sed -i 's/LLDP_LOOP_ACTIONS = \["log"\]/LLDP_LOOP_ACTIONS = \["disable","log"\]/' /var/lib/kytos/napps/kytos/of_lldp/settings.py
kytos-1  | + sed -i 's/LLDP_IGNORED_LOOPS = {}/LLDP_IGNORED_LOOPS = {"00:00:00:00:00:00:00:01": \[\[4, 5\]\]}/' /var/lib/kytos/napps/kytos/of_lldp/settings.py
kytos-1  | + sed -i 's/CONSISTENCY_COOKIE_IGNORED_RANGE =.*/CONSISTENCY_COOKIE_IGNORED_RANGE = [(0xdd00000000000000, 0xdd00000000000009)]/g' /var/lib/kytos/napps/kytos/flow_manager/settings.py
kytos-1  | + sed -i 's/LIVENESS_DEAD_MULTIPLIER =.*/LIVENESS_DEAD_MULTIPLIER = 3/g' /var/lib/kytos/napps/kytos/of_lldp/settings.py
kytos-1  | + kytosd --help
kytos-1  | + sed -i s/WARNING/INFO/g /etc/kytos/logging.ini
kytos-1  | + test -z ''
kytos-1  | + TESTS=tests/
kytos-1  | + test -z ''
kytos-1  | + RERUNS=2
kytos-1  | + python3 scripts/wait_for_mongo.py
kytos-1  | Trying to run hello command on MongoDB...
kytos-1  | Trying to run 'hello' command on MongoDB...
kytos-1  | Trying to run 'hello' command on MongoDB...
kytos-1  | Ran 'hello' command on MongoDB successfully. It's ready!
kytos-1  | + python3 -m pytest tests/ --reruns 2 -r fEr
kytos-1  | ============================= test session starts ==============================
kytos-1  | platform linux -- Python 3.11.2, pytest-8.1.1, pluggy-1.5.0
kytos-1  | rootdir: /tests
kytos-1  | plugins: rerunfailures-13.0, timeout-2.2.0, anyio-4.3.0
kytos-1  | collected 279 items
kytos-1  | 
kytos-1  | tests/test_e2e_01_kytos_startup.py ..                                    [  0%]
kytos-1  | tests/test_e2e_05_topology.py ..................                         [  7%]
kytos-1  | tests/test_e2e_06_topology.py ....                                       [  8%]
kytos-1  | tests/test_e2e_10_mef_eline.py ..........ss.....x.........RRF........... [ 22%]
kytos-1  | .RRFRRF                                                                  [ 23%]
kytos-1  | tests/test_e2e_11_mef_eline.py ......                                    [ 25%]
kytos-1  | tests/test_e2e_12_mef_eline.py .....Xx.                                  [ 28%]
kytos-1  | tests/test_e2e_13_mef_eline.py ....Xs.s.....Xs.s.XXxX.xxxx..X........... [ 43%]
kytos-1  | .                                                                        [ 43%]
kytos-1  | tests/test_e2e_14_mef_eline.py xRRFRRFRRF                                [ 45%]
kytos-1  | tests/test_e2e_15_mef_eline.py .....                                     [ 46%]
kytos-1  | tests/test_e2e_16_mef_eline.py ..                                        [ 47%]
kytos-1  | tests/test_e2e_17_mef_eline.py ...                                       [ 48%]
kytos-1  | tests/test_e2e_20_flow_manager.py ..........................             [ 58%]
kytos-1  | tests/test_e2e_21_flow_manager.py ...                                    [ 59%]
kytos-1  | tests/test_e2e_22_flow_manager.py ...............                        [ 64%]
kytos-1  | tests/test_e2e_23_flow_manager.py ..............                         [ 69%]
kytos-1  | tests/test_e2e_30_of_lldp.py .R...                                       [ 70%]
kytos-1  | tests/test_e2e_31_of_lldp.py ...                                         [ 72%]
kytos-1  | tests/test_e2e_32_of_lldp.py ...                                         [ 73%]
kytos-1  | tests/test_e2e_40_sdntrace.py ................                           [ 78%]
kytos-1  | tests/test_e2e_41_kytos_auth.py ........                                 [ 81%]
kytos-1  | tests/test_e2e_42_sdntrace.py ..                                         [ 82%]
kytos-1  | tests/test_e2e_50_maintenance.py ............................            [ 92%]
kytos-1  | tests/test_e2e_60_of_multi_table.py .....                                [ 94%]
kytos-1  | tests/test_e2e_70_kytos_stats.py ........                                [ 97%]
kytos-1  | tests/test_e2e_80_pathfinder.py ss......                                 [100%]
kytos-1  | 
kytos-1  | =================================== FAILURES ===================================
...
kytos-1  | =========================== rerun test summary info ============================
kytos-1  | RERUN test_e2e_10_mef_eline.py::TestE2EMefEline::test_150_current_path_value_given_dynamic_backup_path_and_empty_primary_conditions
kytos-1  | RERUN test_e2e_10_mef_eline.py::TestE2EMefEline::test_150_current_path_value_given_dynamic_backup_path_and_empty_primary_conditions
kytos-1  | RERUN test_e2e_10_mef_eline.py::TestE2EMefEline::test_300_inter_evc_dynamic_uni_status
kytos-1  | RERUN test_e2e_10_mef_eline.py::TestE2EMefEline::test_300_inter_evc_dynamic_uni_status
kytos-1  | RERUN test_e2e_10_mef_eline.py::TestE2EMefEline::test_301_inter_evc_static_uni_status
kytos-1  | RERUN test_e2e_10_mef_eline.py::TestE2EMefEline::test_301_inter_evc_static_uni_status
kytos-1  | RERUN test_e2e_14_mef_eline.py::TestE2EMefEline::test_010_redeploy_avoid_vlan
kytos-1  | RERUN test_e2e_14_mef_eline.py::TestE2EMefEline::test_010_redeploy_avoid_vlan
kytos-1  | RERUN test_e2e_14_mef_eline.py::TestE2EMefEline::test_015_redeploy_avoid_primary_path
kytos-1  | RERUN test_e2e_14_mef_eline.py::TestE2EMefEline::test_015_redeploy_avoid_primary_path
kytos-1  | RERUN test_e2e_14_mef_eline.py::TestE2EMefEline::test_020_redeploy_avoid_vlan_static_path
kytos-1  | RERUN test_e2e_14_mef_eline.py::TestE2EMefEline::test_020_redeploy_avoid_vlan_static_path
kytos-1  | RERUN test_e2e_30_of_lldp.py::TestE2EOfLLDP::test_010_disable_of_lldp
kytos-1  | =========================== short test summary info ============================
kytos-1  | FAILED tests/test_e2e_10_mef_eline.py::TestE2EMefEline::test_150_current_path_value_given_dynamic_backup_path_and_empty_primary_conditions
kytos-1  | FAILED tests/test_e2e_10_mef_eline.py::TestE2EMefEline::test_300_inter_evc_dynamic_uni_status
kytos-1  | FAILED tests/test_e2e_10_mef_eline.py::TestE2EMefEline::test_301_inter_evc_static_uni_status
kytos-1  | FAILED tests/test_e2e_14_mef_eline.py::TestE2EMefEline::test_010_redeploy_avoid_vlan
kytos-1  | FAILED tests/test_e2e_14_mef_eline.py::TestE2EMefEline::test_015_redeploy_avoid_primary_path
kytos-1  | FAILED tests/test_e2e_14_mef_eline.py::TestE2EMefEline::test_020_redeploy_avoid_vlan_static_path
kytos-1  | = 6 failed, 250 passed, 8 skipped, 8 xfailed, 7 xpassed, 1369 warnings, 13 rerun in 11466.55s (3:11:06) =

�[Kkytos-1 exited with code 1

@Ktmi Ktmi requested a review from a team as a code owner December 4, 2024 16:48
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ktmi I read in the description your still working on it and with TODOs in the code. Overall, it's headed in the right direction. Yes, first and foremost we definitely need a local way of reproducing this either deterministically or with a high likelihood to assess with concrete data how it's performing in terms of correctness and failover convergence control plane and data plane switchover flows performance, and e2e test maybe we don't need fully exact one but one that simulates a closely double failure, and other cases like failing both current_path and failover_path when they're not fully disjoint.

Other than that, when you need a final review let me know. I did a partial review with some things that were a bit obvious, other parts I won't assume their status or what happened since I don't know if you'll still address based on what we discussed.

main.py Outdated Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
@Ktmi
Copy link
Author

Ktmi commented Dec 5, 2024

Looking deeper into the code here, the old system wasn't ever making the vlans available when removing the old_path.

@Alopalao Alopalao self-requested a review December 5, 2024 21:01
@viniarck
Copy link
Member

viniarck commented Dec 6, 2024

Looking deeper into the code here, the old system wasn't ever making the vlans available when removing the old_path.

Good finding @Ktmi. Looking on git history up until 2023.2.7, it was using evc.remove_path_flows(evc.old_path), which would send flow mods via a request + make the s vlans available. On 2024.1.+ this regressed when refactored to use events for flow deletions, and indeed making the vlans available were left behind. I've also confirmed on master version this newer issue here:

intf_3 below had a failover_path over it, and intf_4 only the current_path. First, I simulated a link down on failover_path, so it got removed but s_vlan 1 wasn't made available. intf_4 did as expected.

kytos $> intf_3.available_tags
Out[5]: {'vlan': [[2, 3798], [3800, 4095]]}

kytos $> intf_4.available_tags
Out[7]: {'vlan': [[1, 3798], [3800, 4095]]}

On 2023.2.X versions though we still had issues with old_path and how it was accessed, so in certain cases it could leak too, but in newer versions it regressed more.

@Ktmi
Copy link
Author

Ktmi commented Dec 9, 2024

For redeploys, its looking like performing the redeploy process in bulk would be too difficult, however, I can get the undeploying of the EVC done, preparing it for a proper redeploy.

@Ktmi
Copy link
Author

Ktmi commented Dec 13, 2024

Just finished writing some e2e tests for the consistency behaviour. You can find them at kytos-ng/kytos-end-to-end-tests#339. The current version of mef_eline fails all three of those tests. This PR only fails the last one, so I'll be hunting down the cause of that bug.

@Ktmi
Copy link
Author

Ktmi commented Dec 13, 2024

Here are the E2E results with the new tests:

kytos-1  | Starting enhanced syslogd: rsyslogd.
kytos-1  | /etc/openvswitch/conf.db does not exist ... (warning).
kytos-1  | Creating empty database /etc/openvswitch/conf.db.
kytos-1  | Starting ovsdb-server.
kytos-1  | rsyslogd: error during config processing: omfile: chown for file '/var/log/syslog' failed: Operation not permitted [v8.2302.0 try https://www.rsyslog.com/e/2207 ]
kytos-1  | Configuring Open vSwitch system IDs.
kytos-1  | Starting ovs-vswitchd.
kytos-1  | Enabling remote OVSDB managers.
kytos-1  | + '[' -z '' ']'
kytos-1  | + '[' -z '' ']'
kytos-1  | + echo 'There is no NAPPS_PATH specified. Default will be used.'
kytos-1  | + NAPPS_PATH=
kytos-1  | There is no NAPPS_PATH specified. Default will be used.
kytos-1  | + sed -i 's/STATS_INTERVAL = 60/STATS_INTERVAL = 7/g' /var/lib/kytos/napps/kytos/of_core/settings.py
kytos-1  | + sed -i 's/CONSISTENCY_MIN_VERDICT_INTERVAL =.*/CONSISTENCY_MIN_VERDICT_INTERVAL = 60/g' /var/lib/kytos/napps/kytos/flow_manager/settings.py
kytos-1  | + sed -i 's/LINK_UP_TIMER = 10/LINK_UP_TIMER = 1/g' /var/lib/kytos/napps/kytos/topology/settings.py
kytos-1  | + sed -i 's/DEPLOY_EVCS_INTERVAL = 60/DEPLOY_EVCS_INTERVAL = 5/g' /var/lib/kytos/napps/kytos/mef_eline/settings.py
kytos-1  | + sed -i 's/LLDP_LOOP_ACTIONS = \["log"\]/LLDP_LOOP_ACTIONS = \["disable","log"\]/' /var/lib/kytos/napps/kytos/of_lldp/settings.py
kytos-1  | + sed -i 's/LLDP_IGNORED_LOOPS = {}/LLDP_IGNORED_LOOPS = {"00:00:00:00:00:00:00:01": \[\[4, 5\]\]}/' /var/lib/kytos/napps/kytos/of_lldp/settings.py
kytos-1  | + sed -i 's/CONSISTENCY_COOKIE_IGNORED_RANGE =.*/CONSISTENCY_COOKIE_IGNORED_RANGE = [(0xdd00000000000000, 0xdd00000000000009)]/g' /var/lib/kytos/napps/kytos/flow_manager/settings.py
kytos-1  | + sed -i 's/LIVENESS_DEAD_MULTIPLIER =.*/LIVENESS_DEAD_MULTIPLIER = 3/g' /var/lib/kytos/napps/kytos/of_lldp/settings.py
kytos-1  | + kytosd --help
kytos-1  | + sed -i s/WARNING/INFO/g /etc/kytos/logging.ini
kytos-1  | + test -z ''
kytos-1  | + TESTS=tests/
kytos-1  | + test -z ''
kytos-1  | + RERUNS=2
kytos-1  | + python3 scripts/wait_for_mongo.py
kytos-1  | Trying to run hello command on MongoDB...
kytos-1  | Trying to run 'hello' command on MongoDB...
kytos-1  | Trying to run 'hello' command on MongoDB...
kytos-1  | Ran 'hello' command on MongoDB successfully. It's ready!
kytos-1  | + python3 -m pytest tests/test_e2e_17_mef_eline.py
kytos-1  | ============================= test session starts ==============================
kytos-1  | platform linux -- Python 3.11.2, pytest-8.1.1, pluggy-1.5.0
kytos-1  | rootdir: /tests
kytos-1  | plugins: rerunfailures-13.0, timeout-2.2.0, anyio-4.3.0
kytos-1  | collected 3 items
kytos-1  | 
kytos-1  | tests/test_e2e_17_mef_eline.py ..F                                       [100%]
kytos-1  | 
kytos-1  | =================================== FAILURES ===================================
kytos-1  | ______________________ TestE2EMefEline.test_003_link_down ______________________
kytos-1  | 
kytos-1  | self = <tests.test_e2e_17_mef_eline.TestE2EMefEline object at 0x7f70779f8290>
kytos-1  | 
kytos-1  |     def test_003_link_down(self):
kytos-1  |         """Test link down behaviour on failover_path."""
kytos-1  |     
kytos-1  |         self.net.net.configLinkStatus("s1", "s6", "down")
kytos-1  |         self.net.net.configLinkStatus("s3", "s6", "down")
kytos-1  |         self.net.net.configLinkStatus("s5", "s6", "down")
kytos-1  |     
kytos-1  |         payload = {
kytos-1  |             "name": "Link Down Test",
kytos-1  |             "uni_a": {"interface_id": "00:00:00:00:00:00:00:01:1", "tag": {"tag_type": "vlan", "value": 100}},
kytos-1  |             "uni_z": {"interface_id": "00:00:00:00:00:00:00:05:1", "tag": {"tag_type": "vlan", "value": 100}},
kytos-1  |             "enabled": True,
kytos-1  |             "primary_constraints": {
kytos-1  |                 "mandatory_metrics": {
kytos-1  |                     "not_ownership": ["forbidden_link"],
kytos-1  |                 },
kytos-1  |             },
kytos-1  |             "dynamic_backup_path": True,
kytos-1  |         }
kytos-1  |         api_url = KYTOS_API + "/mef_eline/v2/evc/"
kytos-1  |         response = requests.post(api_url, json=payload)
kytos-1  |     
kytos-1  |         assert response.status_code == 201, response.text
kytos-1  |     
kytos-1  |         data = response.json()
kytos-1  |         evc_id =  data["circuit_id"]
kytos-1  |     
kytos-1  |         time.sleep(10)
kytos-1  |     
kytos-1  |         api_url = KYTOS_API + "/mef_eline/v2/evc/"
kytos-1  |         response = requests.get(api_url + evc_id)
kytos-1  |         data = response.json()
kytos-1  |     
kytos-1  |         assert data["current_path"]
kytos-1  |         assert data["failover_path"]
kytos-1  |     
kytos-1  |         # Collect service vlans
kytos-1  |     
kytos-1  |         vlan_allocations = defaultdict[str, list[int]](list)
kytos-1  |     
kytos-1  |         for link in data["failover_path"]:
kytos-1  |             s_vlan = link["metadata"]["s_vlan"]
kytos-1  |             for endpoint in (link["endpoint_a"], link["endpoint_b"]):
kytos-1  |                 vlan_allocations[endpoint["id"]].append(s_vlan)
kytos-1  |     
kytos-1  |         # Close a link that the failover path depends on
kytos-1  |     
kytos-1  |         link = data["failover_path"][1]
kytos-1  |         if link["id"] == LinkID("00:00:00:00:00:00:00:02:3", "00:00:00:00:00:00:00:03:2"):
kytos-1  |             self.net.net.configLinkStatus("s2", "s3", "down")
kytos-1  |         else:
kytos-1  |             self.net.net.configLinkStatus("s2", "s6", "down")
kytos-1  |     
kytos-1  |         time.sleep(10)
kytos-1  |     
kytos-1  |         # EVC should be enabled but not active
kytos-1  |     
kytos-1  |         api_url = KYTOS_API + "/mef_eline/v2/evc/"
kytos-1  |         response = requests.get(api_url + evc_id)
kytos-1  |         data = response.json()
kytos-1  |     
kytos-1  |         assert data["enabled"]
kytos-1  |         assert data["active"]
kytos-1  |     
kytos-1  |         assert data["current_path"]
kytos-1  |         assert not data["failover_path"]
kytos-1  |     
kytos-1  |         # Check that all the s_vlans have been freed
kytos-1  |     
kytos-1  |         api_url = f"{KYTOS_API}/topology/v3/interfaces/tag_ranges"
kytos-1  |     
kytos-1  |         response = requests.get(api_url)
kytos-1  |     
kytos-1  |         assert response.ok, response.text
kytos-1  |     
kytos-1  |         data = response.json()
kytos-1  |     
kytos-1  |         for interface, reserved_tags in vlan_allocations.items():
kytos-1  |             available_tags = data[interface]["available_tags"]
kytos-1  |             for reserved_tag in reserved_tags:
kytos-1  | >               assert any(
kytos-1  |                     reserved_tag["value"] >= range_start and reserved_tag["value"] <= range_end
kytos-1  |                     for (range_start, range_end) in available_tags[reserved_tag["tag_type"]]
kytos-1  |                 ), f"Vlan tag {reserved_tag} on interface {interface}, not released. Available tags: {available_tags}"
kytos-1  | E               AssertionError: Vlan tag {'tag_type': 'vlan', 'value': 2} on interface 00:00:00:00:00:00:00:01:2, not released. Available tags: {'vlan': [[3, 3798], [3800, 4095]]}
kytos-1  | E               assert False
kytos-1  | E                +  where False = any(<generator object TestE2EMefEline.test_003_link_down.<locals>.<genexpr> at 0x7f70779297e0>)
kytos-1  | 
kytos-1  | tests/test_e2e_17_mef_eline.py:394: AssertionError
kytos-1  | =============================== warnings summary ===============================
kytos-1  | test_e2e_17_mef_eline.py: 37 warnings
kytos-1  |   /usr/lib/python3/dist-packages/mininet/node.py:1121: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
kytos-1  |     return ( StrictVersion( cls.OVSVersion ) <
kytos-1  | 
kytos-1  | test_e2e_17_mef_eline.py: 37 warnings
kytos-1  |   /usr/lib/python3/dist-packages/mininet/node.py:1122: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
kytos-1  |     StrictVersion( '1.10' ) )
kytos-1  | 
kytos-1  | -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
kytos-1  | ------------------------------- start/stop times -------------------------------
kytos-1  | test_e2e_17_mef_eline.py::TestE2EMefEline::test_003_link_down: 2024-12-13,17:08:45.783771 - 2024-12-13,17:09:05.906983
kytos-1  | =========================== short test summary info ============================
kytos-1  | FAILED tests/test_e2e_17_mef_eline.py::TestE2EMefEline::test_003_link_down - ...
kytos-1  | ============= 1 failed, 2 passed, 74 warnings in 155.08s (0:02:35) =============

@Ktmi Ktmi force-pushed the fix/cleanup_old_path branch 2 times, most recently from 9cdd0d9 to ef0596e Compare December 18, 2024 20:35
@Ktmi
Copy link
Author

Ktmi commented Dec 18, 2024

@viniarck This should be ready for a final review. Two things of note: one of the tests still needs a rewrite, as we added additionally functionality to link down handler that existing tests don't implement mocks for. Additionally, using the http endpoint for sending the flow mods didn't work out. E2E tests started to fail with that, so I reverted that change.

Ktmi added 2 commits December 19, 2024 11:35
 - Ensure that entire operation is atomic, or as close to as reasonably possible
 - Ensure that the state of the EVC is properly progressed in link down scenarios
 - Ensure that vlans are cleared for removed paths.
@Ktmi Ktmi force-pushed the fix/cleanup_old_path branch from ef0596e to 1f46e55 Compare December 19, 2024 16:41
@Ktmi
Copy link
Author

Ktmi commented Dec 19, 2024

Performed a rebase to make it easier to cherry pick commits.

@Alopalao
Copy link

LGTM, I did some testing to see if the EVC locks are hold and seems to be working as intended. Great reworking of old_path management.

@italovalcy
Copy link

Hi David, I tested this PR using some of the scenarios we saw recently in production and all worked as expected. Very nice done! LGTM!

@Ktmi Ktmi mentioned this pull request Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants