[Feature] Add path failover approach to improve link down convergence #175

italovalcy · 2022-08-08T18:02:45Z

This PR sits on top of PR #171

Description of the change

This pull request implements the concept of failover path on fully dynamic EVCs (those in which dynamic_backup_path = True and not primary_path and not backup_path), as per blueprint EP029.

The main goal of this PR is to improve MEF E-Line convergence for Link Down events, which basically means that when a Link Down event affects an EVC, MEF E-Line should be able to set up an alternative path as fast as possible, ensuring the protection of the L2VPN services.

To do so, the following changes were made:

We introduced the concept of a "Disjoint path from the current path" (PR Adding disjoint path as per EP029 #171) - the failover path will be based on the maximum disjoint path from the current path (which increases the protection level of an EVC)
We designed a methodology to setup the failover path and leave it pre-installed during normal operation of an EVC
We refactored the Link Down event handler so that EVCs that have the failover path can benefit from a faster convergence
We refactored the Link UP event handler so that failover_path can be eventually reconfigured accordingly

To guarantee the correctness of the proposed changes, many tests were executed. Following is a summary of the most important tests:

MEF E-Line Unit and Integration tests

New unit tests were created, and existing tests were refactored to cover the introduced changes. After multiple executions, all tests passed, and no surprises were observed.

Code coverage increased from 88% to 95%. Below is the full description of coverage:

Before this PR:

Name                      Stmts   Miss  Cover
---------------------------------------------
__init__.py                   0      0   100%
controllers/__init__.py      31      0   100%
db/__init__.py                0      0   100%
db/models.py                 54      0   100%
exceptions.py                 5      0   100%
main.py                     496     66    87%
models/__init__.py            3      0   100%
models/evc.py               460     75    84%
models/path.py              125      7    94%
scheduler.py                 69      5    93%
settings.py                   8      0   100%
utils.py                     55      0   100%
---------------------------------------------
TOTAL                      1306    153    88%

After this PR:

Name                      Stmts   Miss  Cover
--------------------------------------
__init__.py                   0      0   100%
controllers/__init__.py      31      0   100%
db/__init__.py                0      0   100%
db/models.py                 55      0   100%
exceptions.py                 5      0   100%
main.py                     552     42    92%  
models/__init__.py            3      0   100%
models/evc.py               548     15    97% 
models/path.py              125      7    94%  
scheduler.py                 69      5    93%   
settings.py                  10      0   100%
utils.py                     55      0   100%
---------------------------------------
TOTAL                      1453     69    95%

Execution of unit tests multiple times (full log available here):

====== 158 passed, 3402 warnings in 84.90s (0:01:24) ======
====== 158 passed, 3402 warnings in 80.67s (0:01:20) ======
====== 158 passed, 3402 warnings in 86.53s (0:01:26) ======
====== 158 passed, 3402 warnings in 80.34s (0:01:20) ======
====== 158 passed, 3402 warnings in 80.00s (0:01:19) ======
====== 158 passed, 3402 warnings in 79.01s (0:01:19) ======
====== 158 passed, 3402 warnings in 78.81s (0:01:18) ======
====== 158 passed, 3402 warnings in 82.85s (0:01:22) ======
====== 158 passed, 3402 warnings in 78.71s (0:01:18) ======
====== 158 passed, 3402 warnings in 82.27s (0:01:22) ======

End-to-End tests

Some adjustments were necessary to end-to-end tests, especially because now the switches along the path will have extra flows for the failover path. The changes to end-to-end tests were documented in the PR kytos-ng/kytos-end-to-end-tests#144.

Below is the result of the end-to-end test execution with changes proposed by both pull requests (Gitlab Job #30320):

tests/test_e2e_01_kytos_startup.py ..                                    [  1%]
tests/test_e2e_05_topology.py ..................                         [ 10%]
tests/test_e2e_10_mef_eline.py .........X.......x....x.....              [ 24%]
tests/test_e2e_11_mef_eline.py .....                                     [ 26%]
tests/test_e2e_12_mef_eline.py .....xx.                                  [ 30%]
tests/test_e2e_13_mef_eline.py .....xxXx......xxXx.XXxX.xxxx..x......... [ 51%]
...                                                                      [ 52%]
tests/test_e2e_14_mef_eline.py x                                         [ 53%]
tests/test_e2e_15_maintenance.py ........................                [ 65%]
tests/test_e2e_20_flow_manager.py ................                       [ 73%]
tests/test_e2e_21_flow_manager.py ..                                     [ 74%]
tests/test_e2e_22_flow_manager.py ...............                        [ 81%]
tests/test_e2e_23_flow_manager.py X......................                [ 93%]
tests/test_e2e_30_of_lldp.py ....                                        [ 95%]
tests/test_e2e_31_of_lldp.py ...                                         [ 96%]
tests/test_e2e_32_of_lldp.py ...                                         [ 98%]
tests/test_e2e_40_sdntrace.py X.x                                        [100%]

==== 173 passed, 18 xfailed, 8 xpassed, 732 warnings in 10604.21s (2:56:44) ====

Exploratory tests considering different types of EVCs

Scenario: using Mininet execution environment with Kytos docker container (kytos docker-composer services), based on AmlightTopo, simulating multiple Link Down events, and the following types of EVCs:

3.a) 20 Fully dynamic EVCs, which will all be affected by one Link Down event in the current path (the best path available during the first deployment). All EVCs in this group can be configured with a maximum disjoint path (fully disjoint path)

3.b) 20 Fully dynamic EVCs, which will all be affected by two Link Down events. Those EVCs have a single point of failure: one of the Links that will be affected by the Link Down event. In other words, the failover path won't be fully disjoint and it will be also affected by Link Down (at this moment, the EVC have no possible path, and it will remain inactive)

3.c) 20 Fully dynamic EVCs, which will only have the failover_path affected by the link down (forcing them to recompute the failover path and setup the flows accordingly)

3.d) 20 EVCs with static paths (primary and backup), which will all be affected the link down event on the primary path (forcing them to converge to the backup path, but this time using MEF E-Line traditional convergence methodology)

Testing methodology (for each test execution):

Create the containers, mongo database, instantiate the mininet topology
Change the switches inactivity probe (60seg)
Enable the Links, Interfaces, Switches on Kytos
Create the 80 EVCs (4 groups of 20 EVCs)
Start iperf servers and clients for each one of the EVCs with a duration of 180seg (~100pps)
wait 20 seconds
Simulate a link down between SoL2 and SCL-SW02 and from MIA1-MIA8
wait more 160 seconds so the experiment can finishes.
Save the experiment results and clean up everything

The test itself evaluates the packet loss and disconnection time from the end user perspective. To do so, we send traffic from one endpoint to another on each EVC and measure the packet loss through the experiment. Below are the results:

--- results-batch-40-400ms-2022 (disconnection time)
VLAN range 1001 1020 (scenario 3.a)
min, 50, 90, 95, 99, max, mean, 95%-CI = 1.13 3.26 4.97 5.53 5.93 5.94 3.30 0.17
VLAN range 1101 1120 (scenario 3.b)
min, 50, 90, 95, 99, max, mean, 95%-CI = 153.24 153.81 154.27 154.40 154.47 154.62 153.83 0.04
VLAN range 1201 1220 (scenario 3.c)
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.00 0.00 0.00 0.00 0.00 0.01 0.00 0.00
VLAN range 1301 1320 (scenario 3.d)
min, 50, 90, 95, 99, max, mean, 95%-CI = 24.63 71.83 121.06 121.75 134.19 137.64 78.48 4.26

A few words about the above results: first of all, scenario 3.c resulted in no packet loss (as expected since only the failover path was affected by link down); second of all, scenario 3.b is totally affected since that EVCs have a single point of failure (the UNI switch has a single uplink connection). Then, the most expressive results are obtained when we compare scenario 3.a (EVCs with proposed failover convergence) with scenario 3.d (EVCs with traditional convergence): the average disconnection time (considering a confidence interval of 95%) for 3.a is 3.29sec +- 0.147, while the scenario 3.d have a measure of 78.48sec +- 4.26.

It is worth mentioning that, since we are running emulation as a performance evaluation strategy, the hardware in which the experiments are being executed matters (especially if you compare the results from 3.a with section 4 below). We are using a virtual machine with 6 vCPUs (2.60GHz) and 8GB of RAM memory for the tests here.

Exploratory tests considering fully dynamic EVCs (i.e., eligible for failover_path)

In this test, we focused on EVCs that will benefit from the failover approach and consider a higher number of EVCs.

Testing methodology (for each test execution):

Create the containers, mongo database, instantiate the mininet topology
Change the switches inactivity probe (60seg)
Enable the Links, Interfaces, Switches on Kytos
Create the 100 EVCs
Start iperf servers and clients for each one of the EVCs with a duration of 300s (~100pps)
wait 20 seconds
Simulate a link down between SoL2 and SCL-SW02
wait more 280 seconds so the experiment can finishes.
Save the experiment results and clean up everything

Each experiment was repeated 10 times. The host used to run the emulation was a Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz (12 cores), 24GB RAM memory.

The following scenarios were evaluated:

4.a) Kytos-ng release 2022.1 (Traditional link down convergence)
4.b) Kytos-ng master branch -- i.e., with all Napps and core using master branch (plus flow_manager with important fixes from PR [Fix] Consistency check to avoid deleting certain recent changes flow_manager#101 and [Fix] Consistency check fixes part 2 flow_manager#105)
4.c) Kytos-ng master branch and MEF E-Line failover branch (using BATCH_SIZE=40 and BATCH_INTERVAL=400ms)

-- scenario 4.a (results-kytos2022.1-2022)
min, 50, 90, 95, 99, max, mean, 95%-CI = 3.61 204.98 278.27 280.00 280.13 280.18 183.78 7.44
-- scenario 4.b (results-master-fix2-2022)
min, 50, 90, 95, 99, max, mean, 95%-CI = 1.34 92.02 166.13 175.12 183.54 188.90 92.43 4.69
-- scenario 4.c (results-batch-40-400ms-fix2-2022)
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.21 0.85 1.18 1.35 1.62 1.67 0.78 0.02

Performance evaluation to determine the BATCH_SIZE/BATCH_INTERVAL parameters

Here the idea was to play around with some possible values for BATCH_SIZE / BATCH_INTERVAL to define good default values. Even though those values may be subject to future tuning in each production environment, having default values that were minimally tested and evaluated may be helpful for network operators to define their own.

results-batch-40-400ms-flow_manager-fix2-2022
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.21 0.85 1.18 1.35 1.62 1.67 0.78 0.02
--
results-no-batch-flow_manager-fix2-2022
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.48 1.13 1.60 1.94 2.19 2.23 1.17 0.03
--
results-batch-50-500ms-flow_manager-fix2-2022
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.24 0.63 1.22 1.45 1.52 1.54 0.71 0.03
--
results-batch-40-200ms-flow_manager-fix2-2022
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.35 0.71 1.15 1.32 1.39 1.41 0.73 0.02
--
results-batch-20-200ms-flow_manager-fix2-2022
min, 50, 90, 95, 99, max, mean, 95%-CI = 0.22 1.05 1.30 1.34 1.36 1.37 0.92 0.03

The results above show that the worst results are obtained when we don't use BATCH mode. The reason is that when no batch operations are applied, FlowManager receives a lot of FlowMod requests and starts getting overwhelmed. The high number of requests to FlowManager are not only related to the path switch over but also to removing the old path, reconfiguring a failover path to the changed EVCs, and so on. When we use batch requests with some pause time, we give space for FlowManager to process those requests for the link down the hot path, and later on, the other requests will be processed (in the background, as separated events).

The other results have their pros and cons. For instance: using BATCH_SIZE=40 and BATCH_INTERVAL=400ms give us the best minimal disconnection time (best convergence); however, when we consider the average + confidence interval, then SIZE=50 and INTERVAL=500ms fits better. Overall metrics shows that SIZE=50 and INTERVAL=500ms delivered the best result.

Release notes

See Changelog

…ormance tests reported in the PR #175

viniarck

@italovalcy, fantastic results and contribution, the final convergence time in scenario 4c with sub sec or close to (for some statistics values) is a tremendous improvement. Very informative and insightful results, it was really interesting to see the batched/waited approach worked well too with the other recent improvements that have been shipped. Also, appreciated your help with using this more complete scenario with the recent fix on flow manager for the consistency check.

CHANGELOG.rst

main.py

db/models.py

main.py

models/evc.py

Co-authored-by: Vinicius Arcanjo <[email protected]>

…dictionary changed size during iteration

Co-authored-by: Vinicius Arcanjo <[email protected]>

adding feature to support path failover (EP029)

262ad9e

italovalcy changed the base branch from master to feat/disjoint_path August 8, 2022 18:03

italovalcy mentioned this pull request Aug 8, 2022

[Feature] Add consistency check to remove alien flows #172

Closed

Italo Valcy added 3 commits August 9, 2022 07:04

adding unit tests and BATCH config

2bbe908

fix lint issues

ce10b94

fix error on batch mode disabled and adding tests

faa8f71

italovalcy mentioned this pull request Aug 11, 2022

[Fix] Adjust MEF E-Line tests to also consider failover_path flows kytos-ng/kytos-end-to-end-tests#144

Merged

viniarck mentioned this pull request Aug 11, 2022

[Release] Bumped 2022.2.0 #176

Merged

italovalcy marked this pull request as ready for review August 11, 2022 19:16

Italo Valcy added 3 commits August 11, 2022 16:22

revert change to report missing terms

ae7bb19

changing default values for BATCH_SIZE and INTERVAL according to perf…

a3d5b9b

…ormance tests reported in the PR #175

update changelog

2de7d32

Base automatically changed from feat/disjoint_path to master August 11, 2022 20:01

Merge branch 'master' into feature/failover

0973bab

italovalcy requested a review from a team August 11, 2022 20:10

viniarck approved these changes Aug 11, 2022

View reviewed changes

CHANGELOG.rst Outdated Show resolved Hide resolved

main.py Show resolved Hide resolved

db/models.py Show resolved Hide resolved

main.py Show resolved Hide resolved

models/evc.py Show resolved Hide resolved

models/evc.py Outdated Show resolved Hide resolved

viniarck mentioned this pull request Aug 11, 2022

RuntimeError: dictionary changed size during iteration on shared self.circuits dict #177

Open

italovalcy and others added 6 commits August 12, 2022 10:31

Update CHANGELOG.rst

0ec689a

Co-authored-by: Vinicius Arcanjo <[email protected]>

using a shallow on self.circuits to fix possible runtimeerror due to …

fe44385

…dictionary changed size during iteration

update openapi.yml with failover_path attribute

1ab573c

update CHANGELOG.rst

d55adc7

Update models/evc.py

97b74d5

Co-authored-by: Vinicius Arcanjo <[email protected]>

merge feature/failover -> origin/feature/failover

0ed80b1

italovalcy merged commit b235eed into master Aug 12, 2022

italovalcy deleted the feature/failover branch August 12, 2022 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add path failover approach to improve link down convergence #175

[Feature] Add path failover approach to improve link down convergence #175

italovalcy commented Aug 8, 2022 •

edited

Loading

viniarck left a comment •

edited

Loading

[Feature] Add path failover approach to improve link down convergence #175

[Feature] Add path failover approach to improve link down convergence #175

Conversation

italovalcy commented Aug 8, 2022 • edited Loading

Description of the change

Release notes

viniarck left a comment • edited Loading

Choose a reason for hiding this comment

italovalcy commented Aug 8, 2022 •

edited

Loading

viniarck left a comment •

edited

Loading