Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paced Flow Manager #187

Merged
merged 13 commits into from
Jul 31, 2024
Merged

Paced Flow Manager #187

merged 13 commits into from
Jul 31, 2024

Conversation

Ktmi
Copy link

@Ktmi Ktmi commented Jul 1, 2024

Part of resolving kytos-ng/of_multi_table#27, kytos-ng/mef_eline#372, and kytos-ng/telemetry_int#47.

Summary

This adds in a pacer to flow manager, pacing flow mods to switches. This does not fully resolve replacing batching, as it removes the control that NApps have over the process. Possible ways fully resolve this are to implement pacing on the producer end, or to use the owner tag on stored flows to get the pacer for the NApp that requested the flow_mod to pace using that NApp's pacer.

Additionally, this allows setting pace on flow mods on a per owner basis. This features is dependent on this PR for kytos.

Local Tests

Tested creating and removing flows. Works as expected. Paces apply appropriately on per owner basis.

End-to-End Tests

Will need to rerun E2E tests

@Ktmi Ktmi requested a review from a team as a code owner July 1, 2024 21:04
@Ktmi Ktmi changed the title Paced/flow manager Paced Flow Manager Jul 1, 2024
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. It ended up great to have all the pacers on flow_manager pacing per owner per dpid, and with the default no_owner fallback. Nicely done.

I requested a few changes though. Also, changelog hasn't been updated, this will land once the PRs replacing the pacers are ready to land too.

For the record, can you post on the local test wireshark FlowMod pps chart, just for mef_eline is good enough to me like the current fixed_window with 100/second and another with 50/second? It'd be great to have that documented here too, just so in the future it can also serve as a reference for whoever is upgrading the library or whatever they might be looking for.

main.py Outdated Show resolved Hide resolved
main.py Show resolved Hide resolved
settings.py Show resolved Hide resolved
settings.py Outdated Show resolved Hide resolved
settings.py Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
@Ktmi
Copy link
Author

Ktmi commented Jul 9, 2024

@viniarck I think I noticed an odd quirk of the code, regarding _install_flows. If _install_flows is used to send flows to mutliple switches, it will send duplicate flows. This is due to the flow_mods intended for specific switches, being sent to every switch in the _send_flow_mods phase of the process. What would be preferable, is if the switch was provided in a tuple or dataclass alongside the flow_mod.

@viniarck
Copy link
Member

viniarck commented Jul 10, 2024

@viniarck I think I noticed an odd quirk of the code, regarding _install_flows. If _install_flows is used to send flows to mutliple switches, it will send duplicate flows. This is due to the flow_mods intended for specific switches, being sent to every switch in the _send_flow_mods phase of the process. What would be preferable, is if the switch was provided in a tuple or dataclass alongside the flow_mod.

@Ktmi,

If _install_flows is used to send flows to mutliple switches, it will send duplicate flows.

Yes, that's the intended behavior. It's for POST v2/flows that sends to all enabled switches.

What would be preferable, is if the switch was provided in a tuple or dataclass alongside the flow_mod.

All things considered how the code has been structured historically, I'd say let's only pass an extra list of owners and call it a day, and include it in the tuple. Let's leave a potential refactor a dataclass in the future, we'd need to linearize the nested for loop as you said and then lots of unit tests to refactor too. It's probably not worth the effort considering also the other major tasks/epics that we've got on our radar.

@viniarck viniarck self-requested a review July 16, 2024 15:11
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ktmi, functionality-wise it's all set from what I reviewed. Very neat how this has been integrated per dpid differentiating per flow owner. What's still remaining before I approve:

  1. Changelog needs to be updated
  2. Can you post wireshark IO flow mods per dpid documenting on this PR that "fixed_window" is working as expected? I'd like to see that once and then it can serve as a history record if one day upstream changes we can look back at the chart and compare. I'd like to see a producer going over the pacer rate and then including sending to different dpids, and it should be expected that no connections should be dropped due to OF keepalive getting delayed on either side.

settings.py Show resolved Hide resolved
settings.py Show resolved Hide resolved
@Ktmi
Copy link
Author

Ktmi commented Jul 22, 2024

Here is a demonstration of the limiting flow mods on mef_eline. I paced the action to 10 per second, while trying to deploy a few hundred EVCs. Be aware, this doesn't exclude flow mods from other NApps.

FlowMod_pace_test_10perSecond

@viniarck
Copy link
Member

Here is a demonstration of the limiting flow mods on mef_eline. I paced the action to 10 per second, while trying to deploy a few hundred EVCs. Be aware, this doesn't exclude flow mods from other NApps.

FlowMod_pace_test_10perSecond

This is good, now go over the default configured rates (100/seceond) stressing over multiple dpids. The time to stress test this is now. I wanna see if it's stable with that value too. Probably no surprises, but let's confirm in practice. If that's solid too, this is good to get merged.

@viniarck
Copy link
Member

^ and then sustain for a few mins, if no connections dropped, that's great.

@Ktmi
Copy link
Author

Ktmi commented Jul 24, 2024

So, here are four cases of deploying flows in flow_manager with a pace of 100 per second per switch. The blue line represents flow mods sent to a single switch, the green the flow mods sent to all switches. Each test is separated by a 10 second interval.

flow_deployment3

The first test is deploying 1000 flows in a single request to a single switch. This results in sending about 100 flow mods per second, as expected

The second test is deploying 1 flow at a time 1000 times to a single switch. The results of this are fairly similar to the results of the first test, with about 100 flow mods per second.

The third test is deploy 1000 flows at a time to 9 different switches simultaneously. The process of sending the flows took about 20 seconds, which means that we are sending flow mods at about 50 per second in this test. Very strange, not sure why that happends. Additionally, this test took about 20 seconds to get started, in spite of the inverval between tests supposed to be only 10 seconds. This indicates that a lot of time is being spent on the process of preparing the flow mods to be sent.

The fourth test is deploying 1 flow at a time 1000 times to 9 different switches. This was done with a few worker threads assigned to each switch to ensure that requests were fairly distributed, and that the pending requests for any given switch wouldn't block sending requests to others. The flow mod rate just plummets with this setup, with the flow mod rate to all switches hovering around 100 per second. This might just be a limit of how fast messages can be processed by the API.

@viniarck
Copy link
Member

viniarck commented Jul 25, 2024

@kmti, good insights. Let's try to further investigate case 3 and 4 to better understand.

A few things that comes to my mind to quickly brainstorm with you, in case that helps:

  • Try using the apm too and keep an eye on the highest impact handlers that pop up
  • Did you notice if the concurrency_limit ended up being reached in the API resulting in 503? Especially on the test case 4
  • It'd be cool to also compare the same tests commenting out the line hitting the pacer. Does it achieve the expected rates?
  • You can potentially explore by sending flows to flow_manager via events from another NApp, with kytos.flow_manager.flows.install (runs on app thread pool) or kytos.flow_manager.flows.single.install (dynamic single thread) -> this one would only perform well if many flows are sent in the event. But, based on what you described looks like only the fourth test had too many requests.

@Ktmi
Copy link
Author

Ktmi commented Jul 26, 2024

So I compared ran the flow deployment tests again with the pacing code commented out, and here's what I got for results:

flow_deployment_no_limit

It's incredibly similar to what we got from doing the tests with pacing.

I also redid the tests with the APM. I instrumented a good portion of the surrounding code.

image

What I found, is that when deploying a single flow, most of the time is spent writing to the DB. This is fine when writing many flows in a single request, but when writing many flows, across many requests, the amount of time required for this process builds up. Additionally, due to the GIL, even having 1000 threads in the API threadpool doesn't help improve performance here. They are all competing for the GIL so they just end up slowing eachother down.

@viniarck viniarck self-requested a review July 29, 2024 14:31
Copy link
Member

@viniarck viniarck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ktmi, good analysis in the last comment, and great to see the rest of the intrumentation too including for pyof pack. Good to know that pacing behaved as expected. I'll leave the PR approved, before merging though please address the points below:

1 - Add a pacer for coloring as a owner too
2 - Can you map and take care a new task to document in a single request (or event) to see the highest number of flow mods per sec it can send for a single dpid without without resulting in any issues?
3 - Solve changelog git conflict

The pacer will ensure that it won't get unstable. But, we'll need this information here to have an idea of the ceiling value that we can try it out, to fine tune telemetry_int kytos-ng/telemetry_int#40. Sure there are other variables involved here, but with mininet and OvS locally with minimal delay in the control plane is fair game too. Having that information will be a pre-requisite for fine tune telemetry_int and then we'll know to never go too much overboard there.

settings.py Show resolved Hide resolved
@Ktmi Ktmi merged commit 2876238 into master Jul 31, 2024
1 check passed
@Ktmi Ktmi deleted the paced/flow_manager branch July 31, 2024 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants