-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change stream list to arrays and optimize stream lookup with SSE2 #508
base: main
Are you sure you want to change the base?
Conversation
a4c8d6c
to
f911812
Compare
Hi, improving the performance of this list and associated lookup has been our list of things to do for a while. |
10-15 is a good estimate for our current common case, where we use 1 audio section (1 local and 1 remote SSRC) and up to 2 video sections with 2 or 3 simulcast substreams (1 media and 1 RTX SSRC per substream per side) in a bundle. We are planning to create more media sections, ideally bound by the number of participants in a conference, which is where the number of SSRCs may grow N-fold. We also plan to add support for FlexFEC, which will add +1 SSRC per video substream per side. |
I am doing media server cascading, and I can have easily more than 100 ssrcs in the same bundle. This would be a great boost for my use case. |
f911812
to
c11dcf3
Compare
Rebased on top of the current master. |
Awesome contribution! Definitely super helpful contribution for conference servers. I'm wondering what people think if this change should be a compile time option only. Because I assume that client side folks, e.g. the browsers and others, are not keen on a performance hit, even if it's not a big one. So the idea would be that at compile time you could choose between optimizations for low amount of RTP streams vs high amount (which hopefully should translate to client side vs server side). |
how about splitting the PR in two:
Even if the second one is not merged, server devs could easily provide their own implementation of the lookup functions |
I don't think a compile-time option is worth the complication considering that the lookup among a few SSRCs is super fast however you implement it. Browsers have other, much more fruitful areas of optimizations. Consider also that libraries such as libsrtp are usually built in one variant for system packaging, and that library may be used by any kind of application, server or client. So I would prefer the code base to remain unified and optimal. However, if there are code reorganising or other requests from libsrtp maintainers that will help getting this merged, please let me know. |
Does not look like this question was asked: what is the change (speedup, if any) for non-SSE2 implementation? I am not sure if @murillo128 intent was that but being able to add additional optimized implementations does make sense In any case - looking forward to this change merging! |
@eakraly The original description shows scalar performance on x86-64 platform. I don't have an ARM test system, but there is also a benchmark attached that you can run on your system. As for optimizations for ARM/aarch64, those would be nice. I suppose, it should be doable, but you will probably have a problem with implementing PMOVMSKB in NEON or SVE, as there isn't a direct equivalent instruction. In any case, I don't have enough experience with NEON and can't test it anyway, so someone else will have to implement it in a separate PR. |
that's why I think splitting the PR in two, one for creating an interface for ssrc lookup and another for adding the sse2 logic would be helpful. |
I can't separate the interface change from the implementation, as the interface implies the array-based implementation (i.e. indexes are used everywhere instead of pointers). IOW, I can't wrap the old implementation in the new interface. The SSE2 part of the PR is actually minor, most of it is plain C implementing the new container and interacting with it. And the maintainers haven't expressed opinion if this approach is acceptable at all, SSE2 or not, or whether there are any organizational or technical changes needed. It looks like they just aren't interested. |
@Lastique I wouldn´t say uninterested but maybe unresponsive. We did discuss this and the other PR at some point and had some concerns around having a "splattering" of SSE2 in the code base. I understand the current solution is not optimal for a larger set of SSRC´s so we need a solution here. I will try to follow up this PR in the coming weeks. Would be good though if there is someone else that can help review the SSE2 part of the PR. I do like the idea of abstracting the list in to a new type and as @murillo128 suggested placing it in a separate file so it can easily be tested and support different implementations if needed. Also keeps the srtp.c file free of non SRTP details. |
0cec15e
to
014d371
Compare
I've rebased onto the current master and:
I'm still publishing this as a single PR because there is not much incentive to merge the generic array-based implementation alone. The main performance winnings come from SSE2. Also, SSE2 code depends on the previous commit that implements the generic version. However, separate commits make it easier to review the generic implementation and SSE2 separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good initial work, but IMHO the split between the srtp.c and the stream_list is not clear.
We need to add an implementation of the stream list that is non-SSE based (i.e. the current linked list one).
The main missing piece are iterator functions to not have to rely on pos
and size
where iterating through the lists
The old linked list implementation is incompatible with the interface required by the SSE2 implementation and therefore is removed. The only generic (non-SSE) implementation that is proposed in this PR is the array-based one. |
014d371
to
810e795
Compare
Sure. I'm not arguing against the system packagers choose what ever they think is best for their systems. None of the browsers uses system level libraries though. They all compile their own versions into the browsers. That is why I was asking for a compile time option, and I would have no problem if the new default is to compile with the SSE2 on by default, as long as vendors who can't ship with a dependency on SSE2 support have the ability to opt out of it. |
Chrome and Firefox require SSE2 for many years now (since 2014 and 2016, respectively). As does Windows 8 and later. Really, x86 processors without SSE2 belong to a museum now. |
I should also note that SSE2 is not mandatory with the proposed patch. If you disable SSE2 through the compiler switches (e.g. |
As discussed in this thread, we have extracted the current stream list implementation and create an internal api + default implementation in #612 |
1235205
to
b087f46
Compare
Rebased on top of libsrtp 2.5.0. The implementation replaces the default linked list based implementation. As I noted before, the generic interface is not suitable for efficient implementation of this array-based list. In particular, position and list size need to be exposed to the calling code to avoid double stream lookup in This patch does not attempt to fix the generic list interface, as I find this rather pointless. Instead, it optimizes the case when the internal (now - array-based) list implementation is used, which will be the majority of use cases, I think. If the generic list interface is to be dropped (which, IMHO, is the right thing to do), I can send another patch that removes |
BTW, the patch now includes #643 fix. |
b087f46
to
ef519c1
Compare
The stream list in the SRTP context is now implemented with two arrays: an array of SSRCs and an array of pointers to the streams corresponding to the SSRCs. The streams no longer form a linked list. Stream lookup by SSRC is now performed over the array of SSRCs, which is considerably faster because it is more cache-friendly. Additionally, the lookup is optimized for SSE2, which provides an additional massive speedup with many streams in the list. Although the lookup still has linear complexity, its absolute times are reduced and with tens to hundreds elements are lower or comparable with a typical rb-tree equivalent. Expected speedup of SSE2 version over the previous implementation: SSRCs speedup (scalar) speedup (SSE2) 1 0.39x 0.22x 3 0.57x 0.23x 5 0.69x 0.62x 10 0.77x 1.43x 20 0.86x 2.38x 30 0.87x 3.44x 50 1.13x 6.21x 100 1.25x 8.51x 200 1.30x 9.83x These numbers were obtained on a Core i7 2600K. At small numbers of SSRCs the new algorithm is somewhat slower, but given that the absolute and relative times of the lookup are very small, that slowdown is not very significant.
ef519c1
to
b6186f4
Compare
|
I wonder how much faster your code can be using intrinsics functions? |
We're currently using (patched) libsrtp 2.5.0, and I haven't updated the PR to main yet. Not sure when I'll have time to update the PR, most likely when a new libsrtp with the new array-based stream list is released. I'll see if I can do it earlier. From the brief look at #674, it looks quite similar to the scalar version of the stream list proposed here. The main difference is that this PR stores SSRCs in a separate array, which allows for better cache locality during lookup and, of course, is necessary for vectorized lookup. Another, admittedly minor difference is that this PR uses nomenclature that is standard in C++, i.e. I'll update the performance test when I update the PR. |
That makes total sense. I've made the corresponding PR |
The stream list in the SRTP context is now implemented with two arrays: an array of SSRCs and an array of pointers to the streams corresponding to the SSRCs. The streams no longer form a single linked list.
Stream lookup by SSRC is now performed over the array of SSRCs, which is considerably faster because it is more cache-friendly. Additionally, the lookup is optimized for SSE2, which provides an additional massive speedup with many streams in the list. Although the lookup still has linear complexity, its absolute times are reduced and with tens to hundreds elements are lower or comparable with a typical rb-tree equivalent.
Expected speedup of SSE2 version over the previous implementation:
At small numbers of SSRCs the new algorithm is somewhat slower, but given that the absolute and relative times of the lookup are very small, that slowdown is not very significant.
The performance benchmark is written in C++ and is attached below, along with test results on a Core i7 2600K (running Ubuntu 20.04, gcc 9.3). The test emulates the current implementation by using
std::list
, and estimates a potential RB-tree implementation performance by usingstd::set
.This PR is somewhat related to #507. It relies on the change to
config_in_cmake.h
made in that PR so that the SSE2 code is enabled for MSVC. Without it MSVC will use scalar code. In all other respects this PR is self-contained (i.e. non-MSVC compilers will use SSE2 intrinsics code).This PR is related to #417. Although it doesn't change the formal O(N) complexity, it does reduce lookup times significantly and shows comparable or better performance than
std::set
(see the attached test results).ssrc_map_test.tar.gz