Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid some "Vec"s in request sending #846

Merged
merged 3 commits into from
Jun 4, 2023
Merged

Avoid some "Vec"s in request sending #846

merged 3 commits into from
Jun 4, 2023

Conversation

psychon
Copy link
Owner

@psychon psychon commented Jun 3, 2023

I was writing a stupid benchmark. "How quickly can I send NoOperation requests?"
use criterion::{criterion_group, criterion_main, Criterion};
use x11rb::protocol::xproto::ConnectionExt;

use libc::c_void;
extern "C" {
    fn xcb_no_operation(c: *mut c_void);
}

fn send_noop(crit: &mut Criterion) {
    let rust_connection = x11rb::rust_connection::RustConnection::connect(None)
        .unwrap()
        .0;
    let xcb_connection = x11rb::xcb_ffi::XCBConnection::connect(None).unwrap().0;

    let mut noop = crit.benchmark_group("send noop");
    noop.bench_function("RustConn", |b| b.iter(|| rust_connection.no_operation()));
    noop.bench_function("XcbConn", |b| b.iter(|| xcb_connection.no_operation()));
    noop.bench_function("raw ffi", |b| {
        b.iter(|| unsafe {
            xcb_no_operation(xcb_connection.get_raw_xcb_connection());
        })
    });
    noop.finish();

    let mut gif = crit.benchmark_group("get input focus");
    gif.bench_function("RustConn", |b| {
        b.iter(|| rust_connection.get_input_focus().unwrap().reply().unwrap())
    });
    gif.bench_function("XcbConn", |b| {
        b.iter(|| xcb_connection.get_input_focus().unwrap().reply().unwrap())
    });
    gif.finish();
}

criterion_group!(benches, send_noop,);
criterion_main!(benches);

The results still surprised me. "raw xcb" is almost two orders of magnitude faster. Sure, this is not actually an apples-to-arranges comparsion (the x11rb-using code actually arranges for errors to be sent as events only when the cookie is dropped while the raw-xcb does so immediately). Anyway, I investigated a bit (running target/release/deps/request_sending* --bench --profile-time 10 'noop/RustCon' under callgrind) and the results indicated that about a quarter of time is spent in the memory allocator (malloc and free).

Thus, this PR removes some uses of Vec in the request sending code and replaces that with arrays. Sadly, array::map does not help here, so some tricks with the code generator had to be pulled.

The result according to criterion (and no, I don't want this benchmark in git since it is actually quite stupid and requires a running X11 server):

send noop/RustConn      time:   [1.0765 µs 1.0774 µs 1.0785 µs]
                        change: [-2.5423% -2.2960% -2.0644%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
send noop/XcbConn       time:   [116.43 ns 116.67 ns 116.96 ns]
                        change: [-23.732% -18.819% -13.703%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) low severe
  2 (2.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe
send noop/raw ffi       time:   [37.023 ns 37.049 ns 37.090 ns]
                        change: [+12.422% +13.812% +14.879%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 24 outliers among 100 measurements (24.00%)
  2 (2.00%) low severe
  10 (10.00%) low mild
  4 (4.00%) high mild
  8 (8.00%) high severe

get input focus/RustConn
                        time:   [18.344 µs 18.374 µs 18.409 µs]
                        change: [-0.5995% -0.3961% -0.1726%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
get input focus/XcbConn time:   [15.032 µs 15.430 µs 15.917 µs]
                        change: [+0.0639% +2.0381% +3.9186%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe

The change for raw ffi is just noise, but indicates to me that this is a bad benchmark (did I already say that?). The almost 20% improvement for XcbConnection just shows how expensive these Vecs actually are.

Where is the remaining time spent? Well, here is a screenshot from kcachegrind for the RustConnection no-op example:

foo

The remaining memory allocation / Vec in x11rb::protocol::xproto::NoOperationRequest::serialize is the one that contains the actual four bytes of request data. This is not easily removed. And no, I don't want to use smallvec for this.

I'm not quite sure what is going on in RustConnection::write_all_vectored. Apparently the compiler has a hard time to optimise this code. Same for WriteBuffer::write_vectored. Each of these function use 8% of the runt time.

psychon added 3 commits June 3, 2023 14:42
The code for the serialising function is now first generated and saved
into a temporary string. This string is then added to the output.

No change in behaviour occurs.

Signed-off-by: Uli Schlachter <[email protected]>
Before this commit, serialising a request returns a
BufWithFds<PiecewiseBuf<'some_lifetime>>, which expands to
(Vec<Cow<'some_lifetime, [u8]>>, Vec<RawFdContainer>). That's a lot of
Vecs and most of them are not necessary since we statically know how
many "pieces" a request needs.

Side note: Why not just use a single Vec<u8>? There are some situations
where we can avoid copying some data. Think e.g. about PutImage.

This commit removes PiecewiseBuf (which is a Vec) and instead makes the
code generator use an array. This is possible since we statically know
the number of pieces for request sending.

The benefit is that this avoids a temporary allocation for the Vec.

Signed-off-by: Uli Schlachter <[email protected]>
Before this commit, the request functions in x11rb and x11rb-async
contained code like this:

    let (bytes, fds) = request0.serialize(major_opcode(conn).await?);
    let slices = bytes.iter().map(|b| IoSlice::new(b)).collect::<Vec<_>>();
    conn.send_request_without_reply(&slices, fds).await

Here, "bytes" is an array (thanks to the previous commit) and each of its
entries is wrapped in an IoSlice. The result is saved in a Vec.

This commit replaces the above with the following code:

    let (bytes, fds) = request0.serialize(major_opcode(conn).await?);
    let slices = [IoSlice::new(&bytes[0])];
    assert_eq!(slices.len(), bytes.len());
    conn.send_request_without_reply(&slices, fds).await

Instead of a temporary allocation with a Vec, another array is created.
I found no generic way to create this array, so instead the code
generator "knows" what the length of the array must be and repeats each
case as necessary.

Just to double-check that we got this right, an assertion is added to
check that "bytes" and "slices" has the same length. Since both are
arrays, their length is a constant and this assertion can hopefully be
optimised away by the compiler.

Signed-off-by: Uli Schlachter <[email protected]>
@codecov
Copy link

codecov bot commented Jun 3, 2023

Codecov Report

Patch coverage: 1.11% and project coverage change: -0.11 ⚠️

Comparison is base (82cbec4) 12.74% compared to head (b2fcdd2) 12.63%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #846      +/-   ##
==========================================
- Coverage   12.74%   12.63%   -0.11%     
==========================================
  Files         189      189              
  Lines      136259   137602    +1343     
==========================================
+ Hits        17367    17392      +25     
- Misses     118892   120210    +1318     
Flag Coverage Δ
tests 12.63% <1.11%> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
x11rb-async/src/protocol/bigreq.rs 0.00% <0.00%> (ø)
x11rb-async/src/protocol/composite.rs 0.00% <0.00%> (ø)
x11rb-async/src/protocol/damage.rs 0.00% <0.00%> (ø)
x11rb-async/src/protocol/dbe.rs 0.00% <0.00%> (ø)
x11rb-async/src/protocol/dpms.rs 0.00% <0.00%> (ø)
x11rb-async/src/protocol/dri2.rs 0.00% <0.00%> (ø)
x11rb-async/src/protocol/dri3.rs 0.00% <0.00%> (ø)
x11rb-async/src/protocol/ge.rs 0.00% <0.00%> (ø)
x11rb-async/src/protocol/glx.rs 0.00% <0.00%> (ø)
x11rb-async/src/protocol/present.rs 0.00% <0.00%> (ø)
... and 90 more

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@eduardosm
Copy link
Collaborator

eduardosm commented Jun 3, 2023

The remaining memory allocation / Vec in x11rb::protocol::xproto::NoOperationRequest::serialize is the one that contains the actual four bytes of request data. This is not easily removed. And no, I don't want to use smallvec for this.

The only solution I can think of right now is passing a scratch buffer to the serialize function, something like:

pub struct NoOperationRequest;
impl NoOperationRequest {
    pub const SCRATCH_SIZE: usize = 4;
    /// Serialize this request into bytes for the provided connection
    pub fn serialize<'input>(self, scratch_buf: &'input mut [u8; Self::SCRATCH_SIZE]) -> BufWithFds<[Cow<'input, [u8]>; 1]> {
        let length_so_far = 0;
        let request0 = scratch_buf;
        *request0 = [
            NO_OPERATION_REQUEST,
            0,
            0,
            0,
        ];
        let length_so_far = length_so_far + request0.len();
        assert_eq!(length_so_far % 4, 0);
        let length = u16::try_from(length_so_far / 4).unwrap_or(0);
        request0[2..4].copy_from_slice(&length.to_ne_bytes());
        ([request0.into()], vec![])
    }
}

@notgull
Copy link
Collaborator

notgull commented Jun 3, 2023

Linking earlier discussion from #687. I might take a crack at implementing the serialization traits I mentioned there this weekend.

@psychon
Copy link
Owner Author

psychon commented Jun 4, 2023

Well, that remaining allocation overhead is 12% of the overall runtime, which is not that much (but also still a lot). But hey, this is measuring how many NoOp requests we can send. For anything "serious", I would expect that this does not matter.

This PR is just trying to do the easy part. It doesn't hurt much and it still helps a bit. Anything that is more complicated perhaps needs a benchmark doing something actually useful and not this stupid thing.

For example, I was thinking about adding a special case to discard_reply that checks whether the call is about the most recently sent request and in that case skips the binary_search (which is taking 19% of run time here - more than the memory allocation + freeing). But that already seemed too complicated for too little gain to me.

Copy link
Collaborator

@eduardosm eduardosm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there is already an improvement here that we can land without waiting to decide about further optimizations.

@mergify mergify bot merged commit 655772c into master Jun 4, 2023
@mergify mergify bot deleted the avoid-vec branch June 4, 2023 09:46
@psychon
Copy link
Owner Author

psychon commented Jun 4, 2023

Thanks :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants