Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add path failover approach to improve link down convergence #175

Merged
merged 14 commits into from
Aug 12, 2022

Conversation

italovalcy
Copy link

@italovalcy italovalcy commented Aug 8, 2022

Fix #156

This PR sits on top of PR #171

Description of the change

This pull request implements the concept of failover path on fully dynamic EVCs (those in which dynamic_backup_path = True and not primary_path and not backup_path), as per blueprint EP029.

The main goal of this PR is to improve MEF E-Line convergence for Link Down events, which basically means that when a Link Down event affects an EVC, MEF E-Line should be able to set up an alternative path as fast as possible, ensuring the protection of the L2VPN services.

To do so, the following changes were made:

  • We introduced the concept of a "Disjoint path from the current path" (PR Adding disjoint path as per EP029 #171) - the failover path will be based on the maximum disjoint path from the current path (which increases the protection level of an EVC)
  • We designed a methodology to setup the failover path and leave it pre-installed during normal operation of an EVC
  • We refactored the Link Down event handler so that EVCs that have the failover path can benefit from a faster convergence
  • We refactored the Link UP event handler so that failover_path can be eventually reconfigured accordingly

To guarantee the correctness of the proposed changes, many tests were executed. Following is a summary of the most important tests:

  1. MEF E-Line Unit and Integration tests

New unit tests were created, and existing tests were refactored to cover the introduced changes. After multiple executions, all tests passed, and no surprises were observed.

Code coverage increased from 88% to 95%. Below is the full description of coverage:

Before this PR:

Name                      Stmts   Miss  Cover
---------------------------------------------
__init__.py                   0      0   100%
controllers/__init__.py      31      0   100%
db/__init__.py                0      0   100%
db/models.py                 54      0   100%
exceptions.py                 5      0   100%
main.py                     496     66    87%
models/__init__.py            3      0   100%
models/evc.py               460     75    84%
models/path.py              125      7    94%
scheduler.py                 69      5    93%
settings.py                   8      0   100%
utils.py                     55      0   100%
---------------------------------------------
TOTAL                      1306    153    88%

After this PR:

Name                      Stmts   Miss  Cover
--------------------------------------
__init__.py                   0      0   100%
controllers/__init__.py      31      0   100%
db/__init__.py                0      0   100%
db/models.py                 55      0   100%
exceptions.py                 5      0   100%
main.py                     552     42    92%  
models/__init__.py            3      0   100%
models/evc.py               548     15    97% 
models/path.py              125      7    94%  
scheduler.py                 69      5    93%   
settings.py                  10      0   100%
utils.py                     55      0   100%
---------------------------------------
TOTAL                      1453     69    95%

Execution of unit tests multiple times (full log available here):

====== 158 passed, 3402 warnings in 84.90s (0:01:24) ======
====== 158 passed, 3402 warnings in 80.67s (0:01:20) ======
====== 158 passed, 3402 warnings in 86.53s (0:01:26) ======
====== 158 passed, 3402 warnings in 80.34s (0:01:20) ======
====== 158 passed, 3402 warnings in 80.00s (0:01:19) ======
====== 158 passed, 3402 warnings in 79.01s (0:01:19) ======
====== 158 passed, 3402 warnings in 78.81s (0:01:18) ======
====== 158 passed, 3402 warnings in 82.85s (0:01:22) ======
====== 158 passed, 3402 warnings in 78.71s (0:01:18) ======
====== 158 passed, 3402 warnings in 82.27s (0:01:22) ======

  1. End-to-End tests

Some adjustments were necessary to end-to-end tests, especially because now the switches along the path will have extra flows for the failover path. The changes to end-to-end tests were documented in the PR kytos-ng/kytos-end-to-end-tests#144.

Below is the result of the end-to-end test execution with changes proposed by both pull requests (Gitlab Job #30320):

tests/test_e2e_01_kytos_startup.py ..                                    [  1%]
tests/test_e2e_05_topology.py ..................                         [ 10%]
tests/test_e2e_10_mef_eline.py .........X.......x....x.....              [ 24%]
tests/test_e2e_11_mef_eline.py .....                                     [ 26%]
tests/test_e2e_12_mef_eline.py .....xx.                                  [ 30%]
tests/test_e2e_13_mef_eline.py .....xxXx......xxXx.XXxX.xxxx..x......... [ 51%]
...                                                                      [ 52%]
tests/test_e2e_14_mef_eline.py x                                         [ 53%]
tests/test_e2e_15_maintenance.py ........................                [ 65%]
tests/test_e2e_20_flow_manager.py ................                       [ 73%]
tests/test_e2e_21_flow_manager.py ..                                     [ 74%]
tests/test_e2e_22_flow_manager.py ...............                        [ 81%]
tests/test_e2e_23_flow_manager.py X......................                [ 93%]
tests/test_e2e_30_of_lldp.py ....                                        [ 95%]
tests/test_e2e_31_of_lldp.py ...                                         [ 96%]
tests/test_e2e_32_of_lldp.py ...                                         [ 98%]
tests/test_e2e_40_sdntrace.py X.x                                        [100%]

==== 173 passed, 18 xfailed, 8 xpassed, 732 warnings in 10604.21s (2:56:44) ====
  1. Exploratory tests considering different types of EVCs

Scenario: using Mininet execution environment with Kytos docker container (kytos docker-composer services), based on AmlightTopo, simulating multiple Link Down events, and the following types of EVCs:

3.a) 20 Fully dynamic EVCs, which will all be affected by one Link Down event in the current path (the best path available during the first deployment). All EVCs in this group can be configured with a maximum disjoint path (fully disjoint path)

3.b) 20 Fully dynamic EVCs, which will all be affected by two Link Down events. Those EVCs have a single point of failure: one of the Links that will be affected by the Link Down event. In other words, the failover path won't be fully disjoint and it will be also affected by Link Down (at this moment, the EVC have no possible path, and it will remain inactive)

3.c) 20 Fully dynamic EVCs, which will only have the failover_path affected by the link down (forcing them to recompute the failover path and setup the flows accordingly)

3.d) 20 EVCs with static paths (primary and backup), which will all be affected the link down event on the primary path (forcing them to converge to the backup path, but this time using MEF E-Line traditional convergence methodology)

Testing methodology (for each test execution):

  • Create the containers, mongo database, instantiate the mininet topology
  • Change the switches inactivity probe (60seg)
  • Enable the Links, Interfaces, Switches on Kytos
  • Create the 80 EVCs (4 groups of 20 EVCs)
  • Start iperf servers and clients for each one of the EVCs with a duration of 180seg (~100pps)
  • wait 20 seconds
  • Simulate a link down between SoL2 and SCL-SW02 and from MIA1-MIA8
  • wait more 160 seconds so the experiment can finishes.
  • Save the experiment results and clean up everything

The test itself evaluates the packet loss and disconnection time from the end user perspective. To do so, we send traffic from one endpoint to another on each EVC and measure the packet loss through the experiment. Below are the results:

--- results-batch-40-400ms-2022 (disconnection time)
VLAN range 1001 1020 (scenario 3.a)
min, 50, 90, 95, 99, max, mean, 95%-CI = 1.13 3.26 4.97 5.53 5.93 5.94 3.30 0.17
VLAN range 1101 1120 (scenario 3.b)
min, 50, 90, 95, 99, max, mean, 95%-CI = 153.24 153.81 154.27 154.40 154.47 154.62 153.83 0.04
VLAN range 1201 1220 (scenario 3.c)
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00
VLAN range 1301 1320 (scenario 3.d)
min, 50, 90, 95, 99, max, mean, 95%-CI = 24.63 71.83 121.06 121.75 134.19 137.64 78.48 4.26

A few words about the above results: first of all, scenario 3.c resulted in no packet loss (as expected since only the failover path was affected by link down); second of all, scenario 3.b is totally affected since that EVCs have a single point of failure (the UNI switch has a single uplink connection). Then, the most expressive results are obtained when we compare scenario 3.a (EVCs with proposed failover convergence) with scenario 3.d (EVCs with traditional convergence): the average disconnection time (considering a confidence interval of 95%) for 3.a is 3.29sec +- 0.147, while the scenario 3.d have a measure of 78.48sec +- 4.26.

It is worth mentioning that, since we are running emulation as a performance evaluation strategy, the hardware in which the experiments are being executed matters (especially if you compare the results from 3.a with section 4 below). We are using a virtual machine with 6 vCPUs (2.60GHz) and 8GB of RAM memory for the tests here.

  1. Exploratory tests considering fully dynamic EVCs (i.e., eligible for failover_path)

In this test, we focused on EVCs that will benefit from the failover approach and consider a higher number of EVCs.

Testing methodology (for each test execution):

  • Create the containers, mongo database, instantiate the mininet topology
  • Change the switches inactivity probe (60seg)
  • Enable the Links, Interfaces, Switches on Kytos
  • Create the 100 EVCs
  • Start iperf servers and clients for each one of the EVCs with a duration of 300s (~100pps)
  • wait 20 seconds
  • Simulate a link down between SoL2 and SCL-SW02
  • wait more 280 seconds so the experiment can finishes.
  • Save the experiment results and clean up everything

Each experiment was repeated 10 times. The host used to run the emulation was a Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz (12 cores), 24GB RAM memory.

The following scenarios were evaluated:

-- scenario 4.a (results-kytos2022.1-2022)
min, 50, 90, 95, 99, max, mean, 95%-CI = 3.61 204.98 278.27 280.00 280.13 280.18 183.78 7.44
-- scenario 4.b (results-master-fix2-2022)
min, 50, 90, 95, 99, max, mean, 95%-CI = 1.34 92.02 166.13 175.12 183.54 188.90 92.43 4.69
-- scenario 4.c (results-batch-40-400ms-fix2-2022)
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.21 0.85 1.18 1.35 1.62 1.67 0.78 0.02
  1. Performance evaluation to determine the BATCH_SIZE/BATCH_INTERVAL parameters

Here the idea was to play around with some possible values for BATCH_SIZE / BATCH_INTERVAL to define good default values. Even though those values may be subject to future tuning in each production environment, having default values that were minimally tested and evaluated may be helpful for network operators to define their own.

results-batch-40-400ms-flow_manager-fix2-2022
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.21 0.85 1.18 1.35 1.62 1.67 0.78 0.02
--
results-no-batch-flow_manager-fix2-2022
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.48 1.13 1.60 1.94 2.19 2.23 1.17 0.03
--
results-batch-50-500ms-flow_manager-fix2-2022
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.24 0.63 1.22 1.45 1.52 1.54 0.71 0.03
--
results-batch-40-200ms-flow_manager-fix2-2022
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.35 0.71 1.15 1.32 1.39 1.41 0.73 0.02
--
results-batch-20-200ms-flow_manager-fix2-2022
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.22 1.05 1.30 1.34 1.36 1.37 0.92 0.03

The results above show that the worst results are obtained when we don't use BATCH mode. The reason is that when no batch operations are applied, FlowManager receives a lot of FlowMod requests and starts getting overwhelmed. The high number of requests to FlowManager are not only related to the path switch over but also to removing the old path, reconfiguring a failover path to the changed EVCs, and so on. When we use batch requests with some pause time, we give space for FlowManager to process those requests for the link down the hot path, and later on, the other requests will be processed (in the background, as separated events).

The other results have their pros and cons. For instance: using BATCH_SIZE=40 and BATCH_INTERVAL=400ms give us the best minimal disconnection time (best convergence); however, when we consider the average + confidence interval, then SIZE=50 and INTERVAL=500ms fits better. Overall metrics shows that SIZE=50 and INTERVAL=500ms delivered the best result.

Release notes

See Changelog

@italovalcy italovalcy changed the base branch from master to feat/disjoint_path August 8, 2022 18:03
Base automatically changed from feat/disjoint_path to master August 11, 2022 20:01
@italovalcy italovalcy requested a review from a team August 11, 2022 20:10
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@italovalcy, fantastic results and contribution, the final convergence time in scenario 4c with sub sec or close to (for some statistics values) is a tremendous improvement. Very informative and insightful results, it was really interesting to see the batched/waited approach worked well too with the other recent improvements that have been shipped. Also, appreciated your help with using this more complete scenario with the recent fix on flow manager for the consistency check.

img

CHANGELOG.rst Outdated Show resolved Hide resolved
main.py Show resolved Hide resolved
db/models.py Show resolved Hide resolved
main.py Show resolved Hide resolved
models/evc.py Show resolved Hide resolved
models/evc.py Outdated Show resolved Hide resolved
@italovalcy italovalcy merged commit b235eed into master Aug 12, 2022
@italovalcy italovalcy deleted the feature/failover branch August 12, 2022 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Placeholder for implementing mef_eline network convergence optimizations
2 participants