-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper handling of outgoing data channel messages (buffering, large messages) #133
Comments
I tried to measure the amount of time it takes to process each conference message in the main loop to see which of them takes most of the most time. In a conference with 4 participants, the majority of the messages take less than 1ms to be handled. However, the same message may under certain circumstances take more time to get processed. I tried to log all occurrences when a message took more than 1ms to handle and sorted them in a descending order:
It's interesting to see that some occurrences of the same event take a significantly different amount of time that they need to be processed which is a bit weird. Take for instance
The discrepancy is rather large. Same could be said for metadata update which consists of us generating a metadata for each participant and then sending it to them:
Mostly in microseconds, but takes milliseconds at times. It's interesting to see a large discrepancy in handling new published track:
Mostly likely caused by a fact that Interestingly enough, the sending of the data channel message alone did not have that big of a footprint alone (which makes #136 of a questionable utility). One obvious improvement would be eliminating the need to send the whole state of the stream/track metadata to the client, but only share the changes. Also, currently we construct a map of the metadata each time when we want to send it. It would be more efficient to maintain the map for each participant separately and update it on change instead of creating and releasing the memory for the map each time we want to send it. |
This is really interesting — thanks for doing this. This could be another argument for media info going in state events, then only the short-term metadata needs to go via the SFU. It would be slightly ironic if an SFU designed to forward large numbers of high res video streams got bottlenecked on JSON marshalling, so moving all of the signaling code out to separate goroutines from the media path definitely sounds like a plan (fwiw in a past media stack, we had a model of having two threads, one dedicated one for moving media around and another for everything else). |
Yeah, for sure 🙂 Though it seems like it does not really bottleneck there, i.e. I could imagine that even if takes 100ms to marshall large requests, it should not really behave as odd as we observed it last time, so I'm still trying to understand what happens when we go past 20 participants. Also, we could drastically reduce the time to generate and send metadata if we could send only the diffs/changes to the other side and could maintain the map of the metadata for each participant separately to prevent generating it every time we need to send an update. Storing it in the room for state events is another possibility, that's right. In this case, the SFU would only share the media-to-mid relation and won't deal with the metadata propagation at all. |
ContextI used https://github.com/vector-im/ec-cli-rust to load the SFU and after several tests were able to reproduce the issue where after reaching the limit of 20 participants, things start to go awry (i.e. the new participants can't see the majority of the other participants and the whole conference gets extremely unstable). The majority of the requests and metadata updates are handled very promptly. Even if they're big - yes, it may take dozen milliseconds, but they are still not noticeable enough to cause the freeze of the session (they would likely become a problem with 100+ participants though), so it's not the blocking operations in the main loop of the conference that hindered us. Checking and adding logsI've observed the following odd things in the logs though:
I've noticed that we lacked some logging in the data channels (specifically, the errors that were returned, but never logged), so I added them and repeated the tests. That helped me to pinpoint the issue. Final resultsNotably, there are 2 things that regularly go wrong once the conference gets larger and that get very wrong when we reach 20 participants or so:
Alternatively, we could contribute to Pion and implement the SCTP extension that supports an explicit EOR (End-Of-Record) flag that would lift the limitation of 64 KiB. FWIW: While researching (2), I checked some relevant sections in the RFC, and according to it, they recommend not going above 16 KiB per message in order to stay compatible with all browsers (historically, browsers supported data channels differently and handled larger messages in a different way, so the behavior may be a bit inconsistent). I wonder if this might be related to #106 as handling data channel messages is one of the things where Firefox and Chrome handled things a bit differently in the past). |
It looks like sending a message on a data channel is a blocking operation (due to the I/O involved in it), in my non-scientific tests it turned out that sending a message on a data channel may sometimes take 30ms. That's not a problem in the majority of cases, but when the conference gets large and participants exchange messages over the data channel, it can delay the processing of the incoming To-Device messages significantly.
UPD:
The text was updated successfully, but these errors were encountered: