-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add path failover approach to improve link down convergence #175
Conversation
…ormance tests reported in the PR #175
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@italovalcy, fantastic results and contribution, the final convergence time in scenario 4c with sub sec or close to (for some statistics values) is a tremendous improvement. Very informative and insightful results, it was really interesting to see the batched/waited approach worked well too with the other recent improvements that have been shipped. Also, appreciated your help with using this more complete scenario with the recent fix on flow manager for the consistency check.
Co-authored-by: Vinicius Arcanjo <[email protected]>
…dictionary changed size during iteration
Co-authored-by: Vinicius Arcanjo <[email protected]>
Fix #156
This PR sits on top of PR #171
Description of the change
This pull request implements the concept of failover path on fully dynamic EVCs (those in which
dynamic_backup_path = True and not primary_path and not backup_path
), as per blueprint EP029.The main goal of this PR is to improve MEF E-Line convergence for Link Down events, which basically means that when a Link Down event affects an EVC, MEF E-Line should be able to set up an alternative path as fast as possible, ensuring the protection of the L2VPN services.
To do so, the following changes were made:
To guarantee the correctness of the proposed changes, many tests were executed. Following is a summary of the most important tests:
New unit tests were created, and existing tests were refactored to cover the introduced changes. After multiple executions, all tests passed, and no surprises were observed.
Code coverage increased from 88% to 95%. Below is the full description of coverage:
Before this PR:
After this PR:
Execution of unit tests multiple times (full log available here):
====== 158 passed, 3402 warnings in 84.90s (0:01:24) ======
====== 158 passed, 3402 warnings in 80.67s (0:01:20) ======
====== 158 passed, 3402 warnings in 86.53s (0:01:26) ======
====== 158 passed, 3402 warnings in 80.34s (0:01:20) ======
====== 158 passed, 3402 warnings in 80.00s (0:01:19) ======
====== 158 passed, 3402 warnings in 79.01s (0:01:19) ======
====== 158 passed, 3402 warnings in 78.81s (0:01:18) ======
====== 158 passed, 3402 warnings in 82.85s (0:01:22) ======
====== 158 passed, 3402 warnings in 78.71s (0:01:18) ======
====== 158 passed, 3402 warnings in 82.27s (0:01:22) ======
Some adjustments were necessary to end-to-end tests, especially because now the switches along the path will have extra flows for the failover path. The changes to end-to-end tests were documented in the PR kytos-ng/kytos-end-to-end-tests#144.
Below is the result of the end-to-end test execution with changes proposed by both pull requests (Gitlab Job #30320):
Scenario: using Mininet execution environment with Kytos docker container (kytos docker-composer services), based on
AmlightTopo
, simulating multiple Link Down events, and the following types of EVCs:3.a) 20 Fully dynamic EVCs, which will all be affected by one Link Down event in the current path (the best path available during the first deployment). All EVCs in this group can be configured with a maximum disjoint path (fully disjoint path)
3.b) 20 Fully dynamic EVCs, which will all be affected by two Link Down events. Those EVCs have a single point of failure: one of the Links that will be affected by the Link Down event. In other words, the failover path won't be fully disjoint and it will be also affected by Link Down (at this moment, the EVC have no possible path, and it will remain inactive)
3.c) 20 Fully dynamic EVCs, which will only have the failover_path affected by the link down (forcing them to recompute the failover path and setup the flows accordingly)
3.d) 20 EVCs with static paths (primary and backup), which will all be affected the link down event on the primary path (forcing them to converge to the backup path, but this time using MEF E-Line traditional convergence methodology)
Testing methodology (for each test execution):
The test itself evaluates the packet loss and disconnection time from the end user perspective. To do so, we send traffic from one endpoint to another on each EVC and measure the packet loss through the experiment. Below are the results:
A few words about the above results: first of all, scenario 3.c resulted in no packet loss (as expected since only the failover path was affected by link down); second of all, scenario 3.b is totally affected since that EVCs have a single point of failure (the UNI switch has a single uplink connection). Then, the most expressive results are obtained when we compare scenario 3.a (EVCs with proposed failover convergence) with scenario 3.d (EVCs with traditional convergence): the average disconnection time (considering a confidence interval of 95%) for 3.a is 3.29sec +- 0.147, while the scenario 3.d have a measure of 78.48sec +- 4.26.
It is worth mentioning that, since we are running emulation as a performance evaluation strategy, the hardware in which the experiments are being executed matters (especially if you compare the results from 3.a with section 4 below). We are using a virtual machine with 6 vCPUs (2.60GHz) and 8GB of RAM memory for the tests here.
In this test, we focused on EVCs that will benefit from the failover approach and consider a higher number of EVCs.
Testing methodology (for each test execution):
Each experiment was repeated 10 times. The host used to run the emulation was a Intel(R) Xeon(R) CPU E5-2643 v3 @ 3.40GHz (12 cores), 24GB RAM memory.
The following scenarios were evaluated:
Here the idea was to play around with some possible values for BATCH_SIZE / BATCH_INTERVAL to define good default values. Even though those values may be subject to future tuning in each production environment, having default values that were minimally tested and evaluated may be helpful for network operators to define their own.
The results above show that the worst results are obtained when we don't use BATCH mode. The reason is that when no batch operations are applied, FlowManager receives a lot of FlowMod requests and starts getting overwhelmed. The high number of requests to FlowManager are not only related to the path switch over but also to removing the old path, reconfiguring a failover path to the changed EVCs, and so on. When we use batch requests with some pause time, we give space for FlowManager to process those requests for the link down the hot path, and later on, the other requests will be processed (in the background, as separated events).
The other results have their pros and cons. For instance: using BATCH_SIZE=40 and BATCH_INTERVAL=400ms give us the best minimal disconnection time (best convergence); however, when we consider the average + confidence interval, then SIZE=50 and INTERVAL=500ms fits better. Overall metrics shows that SIZE=50 and INTERVAL=500ms delivered the best result.
Release notes
See Changelog