-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid some "Vec"s in request sending #846
Conversation
The code for the serialising function is now first generated and saved into a temporary string. This string is then added to the output. No change in behaviour occurs. Signed-off-by: Uli Schlachter <[email protected]>
Before this commit, serialising a request returns a BufWithFds<PiecewiseBuf<'some_lifetime>>, which expands to (Vec<Cow<'some_lifetime, [u8]>>, Vec<RawFdContainer>). That's a lot of Vecs and most of them are not necessary since we statically know how many "pieces" a request needs. Side note: Why not just use a single Vec<u8>? There are some situations where we can avoid copying some data. Think e.g. about PutImage. This commit removes PiecewiseBuf (which is a Vec) and instead makes the code generator use an array. This is possible since we statically know the number of pieces for request sending. The benefit is that this avoids a temporary allocation for the Vec. Signed-off-by: Uli Schlachter <[email protected]>
Before this commit, the request functions in x11rb and x11rb-async contained code like this: let (bytes, fds) = request0.serialize(major_opcode(conn).await?); let slices = bytes.iter().map(|b| IoSlice::new(b)).collect::<Vec<_>>(); conn.send_request_without_reply(&slices, fds).await Here, "bytes" is an array (thanks to the previous commit) and each of its entries is wrapped in an IoSlice. The result is saved in a Vec. This commit replaces the above with the following code: let (bytes, fds) = request0.serialize(major_opcode(conn).await?); let slices = [IoSlice::new(&bytes[0])]; assert_eq!(slices.len(), bytes.len()); conn.send_request_without_reply(&slices, fds).await Instead of a temporary allocation with a Vec, another array is created. I found no generic way to create this array, so instead the code generator "knows" what the length of the array must be and repeats each case as necessary. Just to double-check that we got this right, an assertion is added to check that "bytes" and "slices" has the same length. Since both are arrays, their length is a constant and this assertion can hopefully be optimised away by the compiler. Signed-off-by: Uli Schlachter <[email protected]>
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #846 +/- ##
==========================================
- Coverage 12.74% 12.63% -0.11%
==========================================
Files 189 189
Lines 136259 137602 +1343
==========================================
+ Hits 17367 17392 +25
- Misses 118892 120210 +1318
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
The only solution I can think of right now is passing a scratch buffer to the pub struct NoOperationRequest;
impl NoOperationRequest {
pub const SCRATCH_SIZE: usize = 4;
/// Serialize this request into bytes for the provided connection
pub fn serialize<'input>(self, scratch_buf: &'input mut [u8; Self::SCRATCH_SIZE]) -> BufWithFds<[Cow<'input, [u8]>; 1]> {
let length_so_far = 0;
let request0 = scratch_buf;
*request0 = [
NO_OPERATION_REQUEST,
0,
0,
0,
];
let length_so_far = length_so_far + request0.len();
assert_eq!(length_so_far % 4, 0);
let length = u16::try_from(length_so_far / 4).unwrap_or(0);
request0[2..4].copy_from_slice(&length.to_ne_bytes());
([request0.into()], vec![])
}
} |
Linking earlier discussion from #687. I might take a crack at implementing the serialization traits I mentioned there this weekend. |
Well, that remaining allocation overhead is 12% of the overall runtime, which is not that much (but also still a lot). But hey, this is measuring how many NoOp requests we can send. For anything "serious", I would expect that this does not matter. This PR is just trying to do the easy part. It doesn't hurt much and it still helps a bit. Anything that is more complicated perhaps needs a benchmark doing something actually useful and not this stupid thing. For example, I was thinking about adding a special case to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there is already an improvement here that we can land without waiting to decide about further optimizations.
Thanks :-) |
I was writing a stupid benchmark. "How quickly can I send NoOperation requests?"
The results still surprised me. "raw xcb" is almost two orders of magnitude faster. Sure, this is not actually an apples-to-arranges comparsion (the x11rb-using code actually arranges for errors to be sent as events only when the cookie is dropped while the raw-xcb does so immediately). Anyway, I investigated a bit (running
target/release/deps/request_sending* --bench --profile-time 10 'noop/RustCon'
under callgrind) and the results indicated that about a quarter of time is spent in the memory allocator (malloc
andfree
).Thus, this PR removes some uses of
Vec
in the request sending code and replaces that with arrays. Sadly,array::map
does not help here, so some tricks with the code generator had to be pulled.The result according to
criterion
(and no, I don't want this benchmark in git since it is actually quite stupid and requires a running X11 server):The change for
raw ffi
is just noise, but indicates to me that this is a bad benchmark (did I already say that?). The almost 20% improvement forXcbConnection
just shows how expensive theseVec
s actually are.Where is the remaining time spent? Well, here is a screenshot from
kcachegrind
for theRustConnection
no-op example:The remaining memory allocation /
Vec
inx11rb::protocol::xproto::NoOperationRequest::serialize
is the one that contains the actual four bytes of request data. This is not easily removed. And no, I don't want to usesmallvec
for this.I'm not quite sure what is going on in
RustConnection::write_all_vectored
. Apparently the compiler has a hard time to optimise this code. Same forWriteBuffer::write_vectored
. Each of these function use 8% of the runt time.