Avoid some "Vec"s in request sending #846

psychon · 2023-06-03T14:14:38Z

I was writing a stupid benchmark. "How quickly can I send NoOperation requests?"

use criterion::{criterion_group, criterion_main, Criterion};
use x11rb::protocol::xproto::ConnectionExt;

use libc::c_void;
extern "C" {
    fn xcb_no_operation(c: *mut c_void);
}

fn send_noop(crit: &mut Criterion) {
    let rust_connection = x11rb::rust_connection::RustConnection::connect(None)
        .unwrap()
        .0;
    let xcb_connection = x11rb::xcb_ffi::XCBConnection::connect(None).unwrap().0;

    let mut noop = crit.benchmark_group("send noop");
    noop.bench_function("RustConn", |b| b.iter(|| rust_connection.no_operation()));
    noop.bench_function("XcbConn", |b| b.iter(|| xcb_connection.no_operation()));
    noop.bench_function("raw ffi", |b| {
        b.iter(|| unsafe {
            xcb_no_operation(xcb_connection.get_raw_xcb_connection());
        })
    });
    noop.finish();

    let mut gif = crit.benchmark_group("get input focus");
    gif.bench_function("RustConn", |b| {
        b.iter(|| rust_connection.get_input_focus().unwrap().reply().unwrap())
    });
    gif.bench_function("XcbConn", |b| {
        b.iter(|| xcb_connection.get_input_focus().unwrap().reply().unwrap())
    });
    gif.finish();
}

criterion_group!(benches, send_noop,);
criterion_main!(benches);

The results still surprised me. "raw xcb" is almost two orders of magnitude faster. Sure, this is not actually an apples-to-arranges comparsion (the x11rb-using code actually arranges for errors to be sent as events only when the cookie is dropped while the raw-xcb does so immediately). Anyway, I investigated a bit (running target/release/deps/request_sending* --bench --profile-time 10 'noop/RustCon' under callgrind) and the results indicated that about a quarter of time is spent in the memory allocator (malloc and free).

Thus, this PR removes some uses of Vec in the request sending code and replaces that with arrays. Sadly, array::map does not help here, so some tricks with the code generator had to be pulled.

The result according to criterion (and no, I don't want this benchmark in git since it is actually quite stupid and requires a running X11 server):

send noop/RustConn      time:   [1.0765 µs 1.0774 µs 1.0785 µs]
                        change: [-2.5423% -2.2960% -2.0644%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
send noop/XcbConn       time:   [116.43 ns 116.67 ns 116.96 ns]
                        change: [-23.732% -18.819% -13.703%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) low severe
  2 (2.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe
send noop/raw ffi       time:   [37.023 ns 37.049 ns 37.090 ns]
                        change: [+12.422% +13.812% +14.879%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 24 outliers among 100 measurements (24.00%)
  2 (2.00%) low severe
  10 (10.00%) low mild
  4 (4.00%) high mild
  8 (8.00%) high severe

get input focus/RustConn
                        time:   [18.344 µs 18.374 µs 18.409 µs]
                        change: [-0.5995% -0.3961% -0.1726%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
get input focus/XcbConn time:   [15.032 µs 15.430 µs 15.917 µs]
                        change: [+0.0639% +2.0381% +3.9186%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

The change for raw ffi is just noise, but indicates to me that this is a bad benchmark (did I already say that?). The almost 20% improvement for XcbConnection just shows how expensive these Vecs actually are.

Where is the remaining time spent? Well, here is a screenshot from kcachegrind for the RustConnection no-op example:

The remaining memory allocation / Vec in x11rb::protocol::xproto::NoOperationRequest::serialize is the one that contains the actual four bytes of request data. This is not easily removed. And no, I don't want to use smallvec for this.

I'm not quite sure what is going on in RustConnection::write_all_vectored. Apparently the compiler has a hard time to optimise this code. Same for WriteBuffer::write_vectored. Each of these function use 8% of the runt time.

The code for the serialising function is now first generated and saved into a temporary string. This string is then added to the output. No change in behaviour occurs. Signed-off-by: Uli Schlachter <[email protected]>

Before this commit, serialising a request returns a BufWithFds<PiecewiseBuf<'some_lifetime>>, which expands to (Vec<Cow<'some_lifetime, [u8]>>, Vec<RawFdContainer>). That's a lot of Vecs and most of them are not necessary since we statically know how many "pieces" a request needs. Side note: Why not just use a single Vec<u8>? There are some situations where we can avoid copying some data. Think e.g. about PutImage. This commit removes PiecewiseBuf (which is a Vec) and instead makes the code generator use an array. This is possible since we statically know the number of pieces for request sending. The benefit is that this avoids a temporary allocation for the Vec. Signed-off-by: Uli Schlachter <[email protected]>

Before this commit, the request functions in x11rb and x11rb-async contained code like this: let (bytes, fds) = request0.serialize(major_opcode(conn).await?); let slices = bytes.iter().map(|b| IoSlice::new(b)).collect::<Vec<_>>(); conn.send_request_without_reply(&slices, fds).await Here, "bytes" is an array (thanks to the previous commit) and each of its entries is wrapped in an IoSlice. The result is saved in a Vec. This commit replaces the above with the following code: let (bytes, fds) = request0.serialize(major_opcode(conn).await?); let slices = [IoSlice::new(&bytes[0])]; assert_eq!(slices.len(), bytes.len()); conn.send_request_without_reply(&slices, fds).await Instead of a temporary allocation with a Vec, another array is created. I found no generic way to create this array, so instead the code generator "knows" what the length of the array must be and repeats each case as necessary. Just to double-check that we got this right, an assertion is added to check that "bytes" and "slices" has the same length. Since both are arrays, their length is a constant and this assertion can hopefully be optimised away by the compiler. Signed-off-by: Uli Schlachter <[email protected]>

codecov · 2023-06-03T14:42:37Z

Codecov Report

Patch coverage: 1.11% and project coverage change: -0.11 ⚠️

Comparison is base (82cbec4) 12.74% compared to head (b2fcdd2) 12.63%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #846      +/-   ##
==========================================
- Coverage   12.74%   12.63%   -0.11%     
==========================================
  Files         189      189              
  Lines      136259   137602    +1343     
==========================================
+ Hits        17367    17392      +25     
- Misses     118892   120210    +1318

Flag	Coverage Δ
tests	`12.63% <1.11%> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
x11rb-async/src/protocol/bigreq.rs	`0.00% <0.00%> (ø)`
x11rb-async/src/protocol/composite.rs	`0.00% <0.00%> (ø)`
x11rb-async/src/protocol/damage.rs	`0.00% <0.00%> (ø)`
x11rb-async/src/protocol/dbe.rs	`0.00% <0.00%> (ø)`
x11rb-async/src/protocol/dpms.rs	`0.00% <0.00%> (ø)`
x11rb-async/src/protocol/dri2.rs	`0.00% <0.00%> (ø)`
x11rb-async/src/protocol/dri3.rs	`0.00% <0.00%> (ø)`
x11rb-async/src/protocol/ge.rs	`0.00% <0.00%> (ø)`
x11rb-async/src/protocol/glx.rs	`0.00% <0.00%> (ø)`
x11rb-async/src/protocol/present.rs	`0.00% <0.00%> (ø)`
... and 90 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

eduardosm · 2023-06-03T16:41:31Z

The remaining memory allocation / Vec in x11rb::protocol::xproto::NoOperationRequest::serialize is the one that contains the actual four bytes of request data. This is not easily removed. And no, I don't want to use smallvec for this.

The only solution I can think of right now is passing a scratch buffer to the serialize function, something like:

pub struct NoOperationRequest;
impl NoOperationRequest {
    pub const SCRATCH_SIZE: usize = 4;
    /// Serialize this request into bytes for the provided connection
    pub fn serialize<'input>(self, scratch_buf: &'input mut [u8; Self::SCRATCH_SIZE]) -> BufWithFds<[Cow<'input, [u8]>; 1]> {
        let length_so_far = 0;
        let request0 = scratch_buf;
        *request0 = [
            NO_OPERATION_REQUEST,
            0,
            0,
            0,
        ];
        let length_so_far = length_so_far + request0.len();
        assert_eq!(length_so_far % 4, 0);
        let length = u16::try_from(length_so_far / 4).unwrap_or(0);
        request0[2..4].copy_from_slice(&length.to_ne_bytes());
        ([request0.into()], vec![])
    }
}

notgull · 2023-06-03T18:48:03Z

Linking earlier discussion from #687. I might take a crack at implementing the serialization traits I mentioned there this weekend.

psychon · 2023-06-04T07:09:10Z

Well, that remaining allocation overhead is 12% of the overall runtime, which is not that much (but also still a lot). But hey, this is measuring how many NoOp requests we can send. For anything "serious", I would expect that this does not matter.

This PR is just trying to do the easy part. It doesn't hurt much and it still helps a bit. Anything that is more complicated perhaps needs a benchmark doing something actually useful and not this stupid thing.

For example, I was thinking about adding a special case to discard_reply that checks whether the call is about the most recently sent request and in that case skips the binary_search (which is taking 19% of run time here - more than the memory allocation + freeing). But that already seemed too complicated for too little gain to me.

eduardosm

Yeah, there is already an improvement here that we can land without waiting to decide about further optimizations.

psychon · 2023-06-04T13:35:57Z

Thanks :-)

psychon added 3 commits June 3, 2023 14:42

Refactor some code in the code generator

0396ba5

The code for the serialising function is now first generated and saved into a temporary string. This string is then added to the output. No change in behaviour occurs. Signed-off-by: Uli Schlachter <[email protected]>

eduardosm approved these changes Jun 4, 2023

View reviewed changes

mergify bot merged commit 655772c into master Jun 4, 2023

mergify bot deleted the avoid-vec branch June 4, 2023 09:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid some "Vec"s in request sending #846

Avoid some "Vec"s in request sending #846

psychon commented Jun 3, 2023

codecov bot commented Jun 3, 2023

eduardosm commented Jun 3, 2023 •

edited

Loading

notgull commented Jun 3, 2023

psychon commented Jun 4, 2023

eduardosm left a comment

psychon commented Jun 4, 2023

Avoid some "Vec"s in request sending #846

Avoid some "Vec"s in request sending #846

Conversation

psychon commented Jun 3, 2023

codecov bot commented Jun 3, 2023

Codecov Report

eduardosm commented Jun 3, 2023 • edited Loading

notgull commented Jun 3, 2023

psychon commented Jun 4, 2023

eduardosm left a comment

Choose a reason for hiding this comment

psychon commented Jun 4, 2023

eduardosm commented Jun 3, 2023 •

edited

Loading