-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paced Flow Manager #187
Paced Flow Manager #187
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. It ended up great to have all the pacers on flow_manager pacing per owner per dpid, and with the default no_owner fallback. Nicely done.
I requested a few changes though. Also, changelog hasn't been updated, this will land once the PRs replacing the pacers are ready to land too.
For the record, can you post on the local test wireshark FlowMod pps chart, just for mef_eline is good enough to me like the current fixed_window with 100/second and another with 50/second? It'd be great to have that documented here too, just so in the future it can also serve as a reference for whoever is upgrading the library or whatever they might be looking for.
@viniarck I think I noticed an odd quirk of the code, regarding |
Yes, that's the intended behavior. It's for
All things considered how the code has been structured historically, I'd say let's only pass an extra list of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ktmi, functionality-wise it's all set from what I reviewed. Very neat how this has been integrated per dpid differentiating per flow owner. What's still remaining before I approve:
- Changelog needs to be updated
- Can you post wireshark IO flow mods per dpid documenting on this PR that
"fixed_window"
is working as expected? I'd like to see that once and then it can serve as a history record if one day upstream changes we can look back at the chart and compare. I'd like to see a producer going over the pacer rate and then including sending to different dpids, and it should be expected that no connections should be dropped due to OF keepalive getting delayed on either side.
This is good, now go over the default configured rates (100/seceond) stressing over multiple dpids. The time to stress test this is now. I wanna see if it's stable with that value too. Probably no surprises, but let's confirm in practice. If that's solid too, this is good to get merged. |
^ and then sustain for a few mins, if no connections dropped, that's great. |
So, here are four cases of deploying flows in flow_manager with a pace of 100 per second per switch. The blue line represents flow mods sent to a single switch, the green the flow mods sent to all switches. Each test is separated by a 10 second interval. The first test is deploying 1000 flows in a single request to a single switch. This results in sending about 100 flow mods per second, as expected The second test is deploying 1 flow at a time 1000 times to a single switch. The results of this are fairly similar to the results of the first test, with about 100 flow mods per second. The third test is deploy 1000 flows at a time to 9 different switches simultaneously. The process of sending the flows took about 20 seconds, which means that we are sending flow mods at about 50 per second in this test. Very strange, not sure why that happends. Additionally, this test took about 20 seconds to get started, in spite of the inverval between tests supposed to be only 10 seconds. This indicates that a lot of time is being spent on the process of preparing the flow mods to be sent. The fourth test is deploying 1 flow at a time 1000 times to 9 different switches. This was done with a few worker threads assigned to each switch to ensure that requests were fairly distributed, and that the pending requests for any given switch wouldn't block sending requests to others. The flow mod rate just plummets with this setup, with the flow mod rate to all switches hovering around 100 per second. This might just be a limit of how fast messages can be processed by the API. |
@kmti, good insights. Let's try to further investigate case 3 and 4 to better understand. A few things that comes to my mind to quickly brainstorm with you, in case that helps:
|
So I compared ran the flow deployment tests again with the pacing code commented out, and here's what I got for results: It's incredibly similar to what we got from doing the tests with pacing. I also redid the tests with the APM. I instrumented a good portion of the surrounding code. What I found, is that when deploying a single flow, most of the time is spent writing to the DB. This is fine when writing many flows in a single request, but when writing many flows, across many requests, the amount of time required for this process builds up. Additionally, due to the GIL, even having 1000 threads in the API threadpool doesn't help improve performance here. They are all competing for the GIL so they just end up slowing eachother down. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Ktmi, good analysis in the last comment, and great to see the rest of the intrumentation too including for pyof pack
. Good to know that pacing behaved as expected. I'll leave the PR approved, before merging though please address the points below:
1 - Add a pacer for coloring
as a owner too
2 - Can you map and take care a new task to document in a single request (or event) to see the highest number of flow mods per sec it can send for a single dpid without without resulting in any issues?
3 - Solve changelog git conflict
The pacer will ensure that it won't get unstable. But, we'll need this information here to have an idea of the ceiling value that we can try it out, to fine tune telemetry_int
kytos-ng/telemetry_int#40. Sure there are other variables involved here, but with mininet and OvS locally with minimal delay in the control plane is fair game too. Having that information will be a pre-requisite for fine tune telemetry_int and then we'll know to never go too much overboard there.
Part of resolving kytos-ng/of_multi_table#27, kytos-ng/mef_eline#372, and kytos-ng/telemetry_int#47.
Summary
This adds in a pacer to flow manager, pacing flow mods to switches.
This does not fully resolve replacing batching, as it removes the control that NApps have over the process. Possible ways fully resolve this are to implement pacing on the producer end, or to use the owner tag on stored flows to get the pacer for the NApp that requested the flow_mod to pace using that NApp's pacer.Additionally, this allows setting pace on flow mods on a per owner basis. This features is dependent on this PR for kytos.
Local Tests
Tested creating and removing flows. Works as expected. Paces apply appropriately on per owner basis.
End-to-End Tests
Will need to rerun E2E tests