Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: don't split header and body across TCP packets #168

Conversation

stormshield-pj50
Copy link

@stormshield-pj50 stormshield-pj50 commented Jul 19, 2023

Currently, header and body of a frame are written to the underlying IO source in two steps. Depending on how eager the underlying stream is about flushing data, this can lead to two TCP packets being sent instead of header and body ending up in one.

By refactoring the internal implementation of Frame, we can reference it as a single byte slice and as a result, only use a single call to poll_write without additional allocations per frame.

@thomaseizinger
Copy link
Contributor

thomaseizinger commented Jul 19, 2023

Thank you for digging into this! This looks like a great improvement. Is there a reason why this needs to be configurable? We will always need to write a header so can't we just always reserve 12 bytes for it?

@thomaseizinger
Copy link
Contributor

Could we also achieve something similar by just changing Io to not write the header and body in two distinct steps? Like, can't we concatenate header and body and try to write them as one byte blob?

@thomaseizinger
Copy link
Contributor

concatenate header and body and try to write them as one byte blob?

This would mean an additional allocation per frame which is not great. But, we could introduce a new constructor for Frame<Data> that gets initialized from a slice. Currently, AsyncWrite::poll_write for Stream allocates anyway for each Frame. If we move this allocation into the Frame type, we can directly allocate an additional 12 bytes and store the header alongside it.

For this to work, we'd also have to change how Io accesses the bytes to be written but for all other frame-types, we don't have any payload, meaning we need something like "encode frame into byte vec" and use that within Io::start_send.

Thoughts?

@stormshield-pj50
Copy link
Author

Thank you for digging into this! This looks like a great improvement. Is there a reason why this needs to be configurable? We will always need to write a header so can't we just always reserve 12 bytes for it?

You're right we can assume that if we have enough space it must be 12 bytes, but in that case we have to expose header::HEADER_SIZE to the outside.

@thomaseizinger
Copy link
Contributor

Thank you for digging into this! This looks like a great improvement. Is there a reason why this needs to be configurable? We will always need to write a header so can't we just always reserve 12 bytes for it?

You're right we can assume that if we have enough space it must be 12 bytes, but in that case we have to expose header::HEADER_SIZE to the outside.

I am not sure I fully understand what you mean?

Regardless, I think that #168 (comment) is an avenue worth exploring because it should avoid the need for any of these APIs by merging header + body into one chunk for data frames which should always result in one TCP packet, assuming it fits.

@stormshield-pj50
Copy link
Author

concatenate header and body and try to write them as one byte blob?

This would mean an additional allocation per frame which is not great. But, we could introduce a new constructor for Frame<Data> that gets initialized from a slice. Currently, AsyncWrite::poll_write for Stream allocates anyway for each Frame. If we move this allocation into the Frame type, we can directly allocate an additional 12 bytes and store the header alongside it.

For this to work, we'd also have to change how Io accesses the bytes to be written but for all other frame-types, we don't have any payload, meaning we need something like "encode frame into byte vec" and use that within Io::start_send.

Thoughts?

So you propose to allocate directly into a new Frame constructor: allocate 12 bytes more for the header, encode the header into the buffer then copy the slice into the buffer at offset 12. Eventually it means the Stream API set_reserved_header_size() becomes useless, isn't it ?

@thomaseizinger
Copy link
Contributor

concatenate header and body and try to write them as one byte blob?

This would mean an additional allocation per frame which is not great. But, we could introduce a new constructor for Frame<Data> that gets initialized from a slice. Currently, AsyncWrite::poll_write for Stream allocates anyway for each Frame. If we move this allocation into the Frame type, we can directly allocate an additional 12 bytes and store the header alongside it.
For this to work, we'd also have to change how Io accesses the bytes to be written but for all other frame-types, we don't have any payload, meaning we need something like "encode frame into byte vec" and use that within Io::start_send.
Thoughts?

So you propose to allocate directly into a new Frame constructor: allocate 12 bytes more for the header, encode the header into the buffer then copy the slice into the buffer at offset 12. Eventually it means the Stream API set_reserved_header_size() becomes useless, isn't it ?

Correct! Frame<Data> is the only variant that actually has a body so I think it makes more sense to embed this feature there. We probably need to slightly change the internals of Frame and the interaction within Io where we extract the data to be written to the socket.

I'd have to look into the details but off the top of my head I am thinking:

  • Expose a &[u8] from Frame
  • For Frame<Data>, we just expose the internal buffer (where header and body are concatenated)
  • For all other frames, we just expose the slice of the buffer
  • Io needs to keep track of how many bytes of the current frame it already wrote. We do something like this already I believe.

The last point makes sure we don't allocate again in Io but instead request the slice and index into it based on how many bytes we already wrote to the socket.

@stormshield-pj50
Copy link
Author

concatenate header and body and try to write them as one byte blob?

This would mean an additional allocation per frame which is not great. But, we could introduce a new constructor for Frame<Data> that gets initialized from a slice. Currently, AsyncWrite::poll_write for Stream allocates anyway for each Frame. If we move this allocation into the Frame type, we can directly allocate an additional 12 bytes and store the header alongside it.
For this to work, we'd also have to change how Io accesses the bytes to be written but for all other frame-types, we don't have any payload, meaning we need something like "encode frame into byte vec" and use that within Io::start_send.
Thoughts?

So you propose to allocate directly into a new Frame constructor: allocate 12 bytes more for the header, encode the header into the buffer then copy the slice into the buffer at offset 12. Eventually it means the Stream API set_reserved_header_size() becomes useless, isn't it ?

Correct! Frame<Data> is the only variant that actually has a body so I think it makes more sense to embed this feature there. We probably need to slightly change the internals of Frame and the interaction within Io where we extract the data to be written to the socket.

I'd have to look into the details but off the top of my head I am thinking:

* Expose a `&[u8]` from `Frame`

* For `Frame<Data>`, we just expose the internal buffer (where header and body are concatenated)

* For all other frames, we just expose the slice of the buffer

* `Io` needs to keep track of how many bytes of the current frame it already wrote. We do something like this already I believe.

The last point makes sure we don't allocate again in Io but instead request the slice and index into it based on how many bytes we already wrote to the socket.

OK that's makes sense. I will try to work on it. In the same time I was wondering if we could avoid the copy of the buffer in poll_write() (or in Frame as you suggest) by implementing a second API on Stream similar to the Sink trait. For instance a start_send() method that takes ownership of a BytesMut and indicates whether there is some room at the start of the buffer for a Yamux header. With this we could avoid an allocation and a copy. What do you think about it ?

yamux/src/frame.rs Outdated Show resolved Hide resolved
@mxinden
Copy link
Member

mxinden commented Aug 2, 2023

Thank you @stormshield-pj50 for posting your findings here!

Our software has some gains of 50% throughput with this patch, as we are mostly sending small buffers with Yamux.

Can you add more details on the setup where you are seeing a 50% throughput increase? What is the latency and bandwidth between the two endpoints? Is it a LAN, WAN or routed over the Internet? How many TCP connections (or other flows) run concurrently? What is the lifetime of a stream on a single connection?

@stormshield-pj50
Copy link
Author

Thank you @stormshield-pj50 for posting your findings here!

Our software has some gains of 50% throughput with this patch, as we are mostly sending small buffers with Yamux.

Can you add more details on the setup where you are seeing a 50% throughput increase? What is the latency and bandwidth between the two endpoints? Is it a LAN, WAN or routed over the Internet? How many TCP connections (or other flows) run concurrently? What is the lifetime of a stream on a single connection?

Hi Max,
Sorry for the late reply, I was on vacation !
Using libp2p + Yamux + rustls, we tunnel a single iperf3 TCP connection between two local docker containers. As we prefix each tunneled packet with its size in the Yamux stream, in the end we write 4 packets for one tunneled packet. The Yamux stream lives as long as the connection.

@stormshield-pj50 stormshield-pj50 force-pushed the yamux_reserved_header_size branch from 70f6777 to 1c06435 Compare September 6, 2023 14:54
Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Couple of questions regarding ergonomics of these new APIs.

yamux/src/frame.rs Outdated Show resolved Hide resolved
yamux/src/connection/stream.rs Outdated Show resolved Hide resolved
yamux/src/connection/stream.rs Outdated Show resolved Hide resolved
yamux/src/frame.rs Outdated Show resolved Hide resolved
yamux/src/frame.rs Outdated Show resolved Hide resolved
yamux/src/connection.rs Outdated Show resolved Hide resolved
Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not yet convinced that this is a feature worth supporting. Keep in mind, I might be missing something.

Using libp2p + Yamux + rustls, we tunnel a single iperf3 TCP connection between two local docker containers.

Do I understand correctly, that this is localhost networking only? Is this your primary use-case? If not, can you please provide numbers on a more realistic network environment (i.e. an environment with significant latency and bandwidth limitations)?

In my eyes this pull request introduces significant complexity. At this point I don't think this complexity is worth the improvement of a minor use-case (localhost networking) of this library.

@thomaseizinger
Copy link
Contributor

In my eyes this pull request introduces significant complexity. At this point I don't think this complexity is worth the improvement of a minor use-case (localhost networking) of this library.

Is it really "significant"? I think we should be able to implement this such that it is almost all contained within Frame.

I think it makes perfect sense and is quite intuitive that a frame can be encoded into a single byte buffer. Not doing that is the weird thing, no?

I agree that we may need to polish the API a bit more but as long as it is contained, I am happy to accept this.

@stormshield-pj50
Copy link
Author

I am not yet convinced that this is a feature worth supporting. Keep in mind, I might be missing something.

Using libp2p + Yamux + rustls, we tunnel a single iperf3 TCP connection between two local docker containers.

Do I understand correctly, that this is localhost networking only? Is this your primary use-case? If not, can you please provide numbers on a more realistic network environment (i.e. an environment with significant latency and bandwidth limitations)?

The first aim of the dev is to prevent sending two TCP packets on the wire each time some data is written to the Yamux stream (one packet for Yamux header and one packet for data). Thus we can save one useless standalone header packet and at least 14 (ETH) + 20 (IP) + 20 (TCP) = 54 bytes for each bunch of data written to the stream. Take a look at the start of the PR for a packet capture that show this. That also means less network syscalls and a more scalable code.

Our local test is just for benchmarking purpose, however I think with or without latency the important thing is to limit the number of packets sent on the wire.

In my eyes this pull request introduces significant complexity. At this point I don't think this complexity is worth the improvement of a minor use-case (localhost networking) of this library.

@thomaseizinger
Copy link
Contributor

@stormshield-pj50 I'll have a play with this API next week, ideally, I'd like to encapsulate more of this in the Frame type. Currently the zero-parsing stuff leaks to many other components which isn't very nice and makes things harder to understand.

Do you think we can refactor this such that it is completely an implementation detail of Frame?

@thomaseizinger
Copy link
Contributor

@stormshield-pj50 I'll have a play with this API next week, ideally, I'd like to encapsulate more of this in the Frame type. Currently the zero-parsing stuff leaks to many other components which isn't very nice and makes things harder to understand.

Do you think we can refactor this such that it is completely an implementation detail of Frame?

I think, if we can contain this optimisation mostly in Frame, then we have a lot less complexity to deal with and can maybe convince @mxinden to land this :)

@thomaseizinger
Copy link
Contributor

@stormshield-pj50 I pushed a commit that outlines a different approach for using zerocopy. It needs more work but I hope that it demonstrates, how I'd like to contain the functionality within the Frame module. As you can see from the diff, connection.rs and other modules are pretty much unaffected.

Unfortunately, zerocopy doesn't handle enums very well as you've probably encountered yourself in the previous commits. They seem to be working on it but I am not sure when they'll land that work. For now, I think the workaround isn't too bad.

Let me know if you are willing to continue down this path :)

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some more notes on what still needs work!

yamux/src/frame.rs Outdated Show resolved Hide resolved
yamux/src/frame.rs Outdated Show resolved Hide resolved
yamux/src/frame.rs Outdated Show resolved Hide resolved
yamux/src/frame.rs Outdated Show resolved Hide resolved
yamux/src/frame.rs Outdated Show resolved Hide resolved
yamux/src/frame/io.rs Outdated Show resolved Hide resolved
@stormshield-pj50
Copy link
Author

@stormshield-pj50 I pushed a commit that outlines a different approach for using zerocopy. It needs more work but I hope that it demonstrates, how I'd like to contain the functionality within the Frame module. As you can see from the diff, connection.rs and other modules are pretty much unaffected.

Unfortunately, zerocopy doesn't handle enums very well as you've probably encountered yourself in the previous commits. They seem to be working on it but I am not sure when they'll land that work. For now, I think the workaround isn't too bad.

Yes zerocopy poor enum handling is a pity. But their TryFromBytes wip could make the job in theory.

Let me know if you are willing to continue down this path :)

Containing zerocopy into the Frame module is a good idea! However I won't be able to work on it this week, maybe next week. I'll finish the work and push a new commit. Thank you for your help!

@thomaseizinger
Copy link
Contributor

Let me know if you are willing to continue down this path :)

Containing zerocopy into the Frame module is a good idea! However I won't be able to work on it this week, maybe next week. I'll finish the work and push a new commit. Thank you for your help!

No rush from my end! Let me know if you have any questions on the code!

@thomaseizinger thomaseizinger changed the title Stream can be configured with a reserved buffer size for data frame Don't split header and body across TCP packets Sep 19, 2023
@thomaseizinger
Copy link
Contributor

Note: Having this would be helpful in debugging the TCP RTT issue in libp2p/test-plans#304. At the moment, header and body are always split into two TCP packets which makes this rather cumbersome to follow along!

@thomaseizinger thomaseizinger requested a review from mxinden October 6, 2023 01:24
@thomaseizinger
Copy link
Contributor

I spent some more cycles on this and I think this is now a quite clean implementation. Would appreciate another review @mxinden! The implementation is now entirely contained within Frame so there isn't much complexity and it is all guarded by the type-system, in some places even more than before!

@thomaseizinger
Copy link
Contributor

@stormshield-pj50 Can you do me two final favours?

  1. Test this in your system to make sure it has the desired effect.
  2. Create a pull-request against rust-libp2p so we run our interoperability tests and check that we didn't break anything.

}

/// Decode a [`Header`] value.
// Decode a [`Header`] value.
pub fn decode(buf: &[u8; HEADER_SIZE]) -> Result<Header<()>, HeaderDecodeError> {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that the decoding should be done on the raw buffer, as we have introduced zerocopy and can work instead on the struct header.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't really good use of the type system to allow construction of types that are in an invalid state. Thus, until zerocopy can give us these guarantees, I'd rather have a dedicated decode function. This somewhat ensures that we are not working with a Header that has invalid values set.

@stormshield-pj50
Copy link
Author

@stormshield-pj50 Can you do me two final favours?

1. Test this in your system to make sure it has the desired effect.

2. Create a pull-request against `rust-libp2p` so we run our interoperability tests and check that we didn't break anything.

Sure! I will do it today.

@stormshield-pj50
Copy link
Author

@stormshield-pj50 Can you do me two final favours?

1. Test this in your system to make sure it has the desired effect.

2. Create a pull-request against `rust-libp2p` so we run our interoperability tests and check that we didn't break anything.

Sure! I will do it today.

Integration tests PR in rust-libp2p: libp2p/rust-libp2p#4598

@thomaseizinger thomaseizinger changed the title Don't split header and body across TCP packets feat: don't split header and body across TCP packets Oct 9, 2023
@stormshield-pj50
Copy link
Author

@stormshield-pj50 Can you do me two final favours?

1. Test this in your system to make sure it has the desired effect.

2. Create a pull-request against `rust-libp2p` so we run our interoperability tests and check that we didn't break anything.

Sure! I will do it today.

On our side, our CI tests are all OK.

Copy link
Contributor

@thomaseizinger thomaseizinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good from my end but also needs @mxinden's approval.

Copy link
Member

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I am fine moving forward here. The rational behind the change makes sense. Thank you for the continuous work.

While intuitively I expect this to have no performance drawback, I would still like to see some benchmarks. Would you mind:

fn poll_new_outbound(&mut self, cx: &mut Context<'_>) -> Poll<Result<Stream>> {
fn poll_new_outbound(&mut self, cx: &Context<'_>) -> Poll<Result<Stream>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why diverge from the Future signature here? In other words, why remove the mut?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Triggered by a clippy lint, we can put it back but given that it is an internal API, I wouldn't bother.

yamux/Cargo.toml Outdated

[dev-dependencies]
quickcheck = "1.0"
futures = "0.3.4"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this dev-dependency needed? futures is already imported above using 0.3.12.

use self::header::HEADER_SIZE;

/// A Yamux message frame consisting of a single buffer with header followed by body.
/// The header can be zerocopy parsed into a Header struct by calling header()/header_mut().
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The header can be zerocopy parsed into a Header struct by calling header()/header_mut().
/// The header can be zerocopy parsed into a Header struct by calling `Header::header` and `Header::header_mut`.


/// The message frame header.
#[derive(Clone, Debug, PartialEq, Eq)]
#[derive(Clone, PartialEq, Eq, FromBytes, AsBytes, FromZeroes)]
#[repr(packed)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this attribute the struct is no longer aligned to the target machine architecture. Do I understand correctly that we don't expect many accesses to Header fields and thus for this not to have a performance impact?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is necessary for zerocopy to do its job so I don't think we have a choice unless we want to trade zerocopy for more allocations.

@thomaseizinger
Copy link
Contributor

  • Running the concurrent.rs benchmarks, especially the unconstrained, comparing this patch with latest master, e.g. using critcmp?

I'd take these with a grain of salt. When I worked on the refactoring to poll APIs, I couldn't get really consistent numbers out of repeated benchmark runs without code changes, i.e. it was hard to establish a base-line. Plugging it into your perf-setup might make more sense.

@stormshield-pj50
Copy link
Author

Here's the unconstrained compared benchmarks between master and this PR:
critcmp

@thomaseizinger
Copy link
Contributor

Great results! Looks like we actually measurably improved performance :)

@stormshield-pj50
Copy link
Author

Results from our iperf3 CI test:

With Yamux master:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   324 MBytes   272 Mbits/sec   26             sender
[  5]   0.00-10.05  sec   322 MBytes   269 Mbits/sec                  receiver

With Yamux PR:
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec   831 MBytes   697 Mbits/sec  32              sender
[  5]   0.00-10.04  sec   829 MBytes   693 Mbits/sec                  receiver

@thomaseizinger
Copy link
Contributor

We just discussed this in the open maintainers call and @umgefahren made a really good suggestion: Instead of doing all this work with zerocopy, we could just use AsyncWrite::poll_write_vectored. That would allow us to pass two buffers down to the underlying implementation.

It is unfortunate that this idea only comes up now but I think it is well worth exploring as it would avoid the complexity of zerocopy.

@stormshield-pj50
Copy link
Author

We just discussed this in the open maintainers call and @umgefahren made a really good suggestion: Instead of doing all this work with zerocopy, we could just use AsyncWrite::poll_write_vectored. That would allow us to pass two buffers down to the underlying implementation.

It is unfortunate that this idea only comes up now but I think it is well worth exploring as it would avoid the complexity of zerocopy.

You can't be sure that poll_write_vectored() is well implemented on the underlying stream, as the default implementation takes the first non empty buffer and call poll_write() on it. I tested it a couple of weeks ago before submitting this PR and it was working with a TcpStream but not with a TlsStream on top of it.

@thomaseizinger
Copy link
Contributor

We just discussed this in the open maintainers call and @umgefahren made a really good suggestion: Instead of doing all this work with zerocopy, we could just use AsyncWrite::poll_write_vectored. That would allow us to pass two buffers down to the underlying implementation.
It is unfortunate that this idea only comes up now but I think it is well worth exploring as it would avoid the complexity of zerocopy.

You can't be sure that poll_write_vectored() is well implemented on the underlying stream, as the default implementation takes the first non empty buffer and call poll_write() on it. I tested it a couple of weeks ago before submitting this PR and it was working with a TcpStream but not with a TlsStream on top of it.

But isn't that just a shortcoming of the TlsStream then? Shouldn't we patch those impls instead of making this more complex?

@stormshield-pj50
Copy link
Author

We just discussed this in the open maintainers call and @umgefahren made a really good suggestion: Instead of doing all this work with zerocopy, we could just use AsyncWrite::poll_write_vectored. That would allow us to pass two buffers down to the underlying implementation.
It is unfortunate that this idea only comes up now but I think it is well worth exploring as it would avoid the complexity of zerocopy.

You can't be sure that poll_write_vectored() is well implemented on the underlying stream, as the default implementation takes the first non empty buffer and call poll_write() on it. I tested it a couple of weeks ago before submitting this PR and it was working with a TcpStream but not with a TlsStream on top of it.

But isn't that just a shortcoming of the TlsStream then? Shouldn't we patch those impls instead of making this more complex?

On our side we won't spend more work time on this PR. We think the implementation is good enough:

  • zerocopy is a mainstream crate designed by Google and used by many Fuchsia OS devices.
  • internally zerocopy is just a safe cast with endianness bytes swap when needed.
  • the PR solution is self contained and we think it's not a good idea to rely on an underlying stream implementation of poll_write_vectored(), which by the way is optional in the specifications.
  • we have been using this code for months in our code base and it's working fine with improved performance as shown before in the previous discussion.

@thomaseizinger
Copy link
Contributor

Which TLS library are you using?

@stormshield-pj50
Copy link
Author

Which TLS library are you using?

tokio-rustls

@thomaseizinger
Copy link
Contributor

We just discussed this in the open maintainers call and @umgefahren made a really good suggestion: Instead of doing all this work with zerocopy, we could just use AsyncWrite::poll_write_vectored. That would allow us to pass two buffers down to the underlying implementation.
It is unfortunate that this idea only comes up now but I think it is well worth exploring as it would avoid the complexity of zerocopy.

You can't be sure that poll_write_vectored() is well implemented on the underlying stream, as the default implementation takes the first non empty buffer and call poll_write() on it. I tested it a couple of weeks ago before submitting this PR and it was working with a TcpStream but not with a TlsStream on top of it.

But isn't that just a shortcoming of the TlsStream then? Shouldn't we patch those impls instead of making this more complex?

On our side we won't spend more work time on this PR. We think the implementation is good enough:

  • zerocopy is a mainstream crate designed by Google and used by many Fuchsia OS devices.
  • internally zerocopy is just a safe cast with endianness bytes swap when needed.
  • the PR solution is self contained and we think it's not a good idea to rely on an underlying stream implementation of poll_write_vectored(), which by the way is optional in the specifications.
  • we have been using this code for months in our code base and it's working fine with improved performance as shown before in the previous discussion.

Thank you for writing down your view-points. zerocopy is definitely a well-designed library and I am not disputing its production-readiness. I do want to stress though that it isn't intrinsically necessary for this feature.

If we didn't care about reduced allocations, we could easily keep the Frame type as is and simply create a new buffer and write that in a single pass to the underlying stream. The only thing that zerocopy gives us is reducing this allocation. At the same time, we currently have to implement hacks because zerocopy doesn't allow for any validations during parsing.

Unfortunately, despite being familiar with AsyncWrite, it completely slipped my mind that we could use poll_write_vectored to achieve the exact same thing without taking on another dependency and deviating from what would be the "natural" design of Frame when ignoring performance optimisations.

I cannot follow your reasoning why we shouldn't rely on AsyncWrite::poll_write_vectored. In fact, I find using that is a very clean solution. The abstraction is specifically there to optimise performance when it is possible to write multiple buffers at the same time. tokio-rustls doesn't override that in its AsyncWrite implementation but I would be very surprised if they wouldn't accept a patch to do that.

Overall, I am hesitant to merge this as is knowing that there is an easier and more maintainable solution. Our time is limited and we need to carefully consider, what we take on as maintenance responsibility.

Thank you for iterating on this solution with me. I wish we would have talked about AsyncWrite::poll_write_vectored earlier before going down this pathway. I accept that you don't want to put any more time into it. I hope you understand that we don't want to take on more maintenance burden than necessary.

If I find some free time next week, I might check how hard it would be change to AsyncWrite::poll_write_vectored. I have a feeling it would be quite easy!

@thomaseizinger
Copy link
Contributor

tokio-rustls doesn't override that in its AsyncWrite implementation but I would be very surprised if they wouldn't accept a patch to do that.

I've opened issue with them to discuss the implementation: rustls/tokio-rustls#26. I think that properly implementing this could be a nice optimization that the entire async rustls ecosystem can benefit from which is another reason on why we shouldn't build a specialised implementation here in rust-yamux for this problem.

@stormshield-pj50
Copy link
Author

tokio-rustls doesn't override that in its AsyncWrite implementation but I would be very surprised if they wouldn't accept a patch to do that.

I've opened issue with them to discuss the implementation: rustls/tokio-rustls#26. I think that properly implementing this could be a nice optimization that the entire async rustls ecosystem can benefit from which is another reason on why we shouldn't build a specialised implementation here in rust-yamux for this problem.

OK that's fine for us. However if you want to cover all Tokio ecosystem libp2p related, you should also take a look at Tokio Tunsgtenite WebSockets and also Yamux over Yamux as used by libp2p relay_v2.

@thomaseizinger
Copy link
Contributor

I am closing this as stale.

For reference, I started working on correctly forwarding vectored writes but ran out of time. Here are the drafts / issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants