Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added retry for requests used by EVCs #491

Merged
merged 8 commits into from
Aug 16, 2024
Merged

Added retry for requests used by EVCs #491

merged 8 commits into from
Aug 16, 2024

Conversation

Alopalao
Copy link

@Alopalao Alopalao commented Aug 8, 2024

Closes #483
Closes #184

Summary

Creation/deletion of flows error by Service Unavailable can't be stopped since number of requests can go to infinite.
This PR adds two crucial approaches to try to minimize Service Unavailable error results:

Add retry to requests made by EVCs (flow_manager and pathfinder)

The retry is applied with a random timing of 5 to seconds in three instances. Because Service Unavailable happens when multiple requests are send in a short period of time, the randomness is meant to help with concurrent requests.

Adding error_status to EVCs

Extra measures

  • Added extra field to the response for POST request when creating an EVC, now looks like {"circuit_id": "08352c061d3447", "deployed": false}.
  • Added extra field to the response for PATH request when updating EVC. Now it includes {'redeployed': false}
  • Added extra string to the response for PATCH request when redeploying EVC. If an error when deleting old flows an extra message with the error will be added to the response of 409.
    - Mechanic with evc.old_path has been changed to evc.old_paths which is now a list additionally containing all the old paths that failed to be deleted. This error will not change the state of error_status nor disable nor deactivate the EVC. Another event kytos/mef_eline.cleanup_evcs_old_path will be sent containing the failed EVCs. The handling of this event is now dynamic_single.
  • Introduced two new exceptions EVCPathNotDeleted and EVCPathNotInstalled. These are raised the the methods that catch FlowModException.
  • Removed requests in favor of httpx

Local Tests

Stress tested with 100 to 140 EVC concurrent deletions and creations. These add up to ~2000 EVCs per test. Tried with 150 requests but at that point mef_eline was failing which did not create any issues just lost requests. api_concurrency_limit was not changed.

End-to-End Tests

+ python3 -m pytest tests/ --reruns 2 -r fEr
============================= test session starts ==============================
platform linux -- Python 3.11.2, pytest-8.1.1, pluggy-1.5.0
rootdir: /tests
plugins: rerunfailures-13.0, timeout-2.2.0, anyio-4.3.0
collected 266 items

tests/test_e2e_01_kytos_startup.py ..                                    [  0%]
tests/test_e2e_05_topology.py ..................                         [  7%]
tests/test_e2e_06_topology.py ....                                       [  9%]
tests/test_e2e_10_mef_eline.py ..........ss.....x.....x................  [ 24%]
tests/test_e2e_11_mef_eline.py ......                                    [ 26%]
tests/test_e2e_12_mef_eline.py .....Xx.                                  [ 29%]
tests/test_e2e_13_mef_eline.py ....Xs.s.....Xs.s.XXxX.xxxx..X........... [ 44%]
.                                                                        [ 45%]
tests/test_e2e_14_mef_eline.py x                                         [ 45%]
tests/test_e2e_15_mef_eline.py .....                                     [ 47%]
tests/test_e2e_16_mef_eline.py .                                         [ 47%]
tests/test_e2e_20_flow_manager.py ......................                 [ 56%]
tests/test_e2e_21_flow_manager.py ...                                    [ 57%]
tests/test_e2e_22_flow_manager.py ...............                        [ 62%]
tests/test_e2e_23_flow_manager.py ..............                         [ 68%]
tests/test_e2e_30_of_lldp.py .R...                                       [ 69%]
tests/test_e2e_31_of_lldp.py ...                                         [ 70%]
tests/test_e2e_32_of_lldp.py ...                                         [ 71%]
tests/test_e2e_40_sdntrace.py ................                           [ 77%]
tests/test_e2e_41_kytos_auth.py ........                                 [ 80%]
tests/test_e2e_42_sdntrace.py ..                                         [ 81%]
tests/test_e2e_50_maintenance.py ............................            [ 92%]
tests/test_e2e_60_of_multi_table.py .....                                [ 93%]
tests/test_e2e_70_kytos_stats.py .....R...                               [ 96%]
tests/test_e2e_80_pathfinder.py ss......                                 [100%]

@Alopalao Alopalao requested a review from a team as a code owner August 8, 2024 16:26
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the retries + set an error state is pretty great. But, there are other things that need to be adapted or changed, check out my comments to see if they make sense.

Also, changelog hasn't been updated, unit tests failing, happy to provide a partial code review though.

main.py Outdated Show resolved Hide resolved
models/evc.py Outdated Show resolved Hide resolved
db/models.py Outdated Show resolved Hide resolved
models/evc.py Outdated Show resolved Hide resolved
models/evc.py Show resolved Hide resolved
models/path.py Outdated Show resolved Hide resolved
models/evc.py Outdated Show resolved Hide resolved
models/evc.py Outdated Show resolved Hide resolved
models/evc.py Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
@Alopalao
Copy link
Author

New update.

  • Errors in the EVC does not longer disable it. This allows for kytos to redeploy if a topology event happens and affects said EVC.
  • Removed requests usage and replaced with httpx
  • Removed unused method from Path class get_best_path()
  • Changed old_path to be a single path again. Its flows are sent through an event now.
  • Tests are fixed but some line are still missing to be tested.

Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality-wise it's almost there, and nicely done eliminating other tech debt along the way like replacing requests with httpx. There are some minor points to refactor and or fix, check out my new comments.

Other than that, I think it's worth discussing whether status_error should also include the high level action that needs to taken, which would be very valuable for users to fix themselves by calling endpoints and/or more automation with consistency check that we can perform the "recovery" action, think about it, let me know what you think and if you can cover that too, if it's worth it. (feel free not to cover unit test for that part until it's sorted out)

utils.py Outdated Show resolved Hide resolved
utils.py Outdated Show resolved Hide resolved
requirements/run.in Show resolved Hide resolved
models/evc.py Show resolved Hide resolved
models/evc.py Outdated Show resolved Hide resolved
models/evc.py Outdated Show resolved Hide resolved
models/evc.py Outdated Show resolved Hide resolved
models/evc.py Outdated Show resolved Hide resolved
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alopalao, I'll recommend that you split the part of setting the error_status in a new PR, that's great as it is, but I'd like that one to evolve and be ready to also integrate well with consistency check eventually, and we're in the last days of the release, I don't we'll be able to sort it as thoroughly as we'd like to. That said, let's only handle retries here, and leave the rest of the behavior as it is, so that's already and improved, and then you can tackle and further specify the the rest of the error handling in next issue in a next version. At the end of the day, with the right concurrency limit this shouldn't be much of a concern so it's OK to delay this part in an exchange to make it gets as polished and as aligned with other changes, otherwise we might end up having to refactor again some values which would result in more rework and db scripts, OK? So, map another issue and we'll continue with that in a next release, proceed with finish this PR and the other 2024.1 issues, thanks.

@Alopalao
Copy link
Author

  • Removed error_status as a result I had to remove raises from methods that remove any flows. They do log.error() now.
  • Change timeout to 30 when sending flows

@viniarck viniarck self-requested a review August 15, 2024 16:35
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Alopalao, great improvements, and I'm glad how this is evolving gradually. I'll leave it pre-approved. Before merging:

  • Augment to parametrize force=True and use it during failover clean up

Finally:

  • Map a new issue just so the rest of error handling that you started with error_status can be further specified and make official in a future release

main.py Show resolved Hide resolved
@viniarck
Copy link
Member

Let's ship it @Alopalao, feel free to merge.

@Alopalao Alopalao merged commit 2e91090 into master Aug 16, 2024
2 checks passed
@Alopalao Alopalao deleted the retry/flow_mod branch August 16, 2024 03:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants