Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MSG_ZEROCOPY for plaintext replication traffic #1543

Open
wants to merge 24 commits into
base: unstable
Choose a base branch
from

Conversation

murphyjacob4
Copy link
Contributor

@murphyjacob4 murphyjacob4 commented Jan 11, 2025

Summary

This PR integrates with MSG_ZEROCOPY for plaintext replication traffic. MSG_ZEROCOPY is a Linux kernel functionality that allows writes to avoid copying data from user space to kernel space. Instead, MSG_ZEROCOPY allows the kernel to pin the user space data into kernel space for asynchronous writing without any copying. In turn, users have to track the ongoing writes and ensure the data is kept stable, listening on the socket's message queue for a completion notification.

Design

We track the ongoing writes through a dynamic circular buffer, zeroCopyTracker. Each ongoing write is indexed by its sequence number, matching the one assigned by the kernel, with each entry holding a refcount into the replication backlog. We also add a new event type to the event loop APIs (AE_ERROR_QUEUE) to register for SO_EE_ORIGIN_ZEROCOPY notifications from the kernel. When we choose to use zero-copy for writes, we append a tracking entry to the circular buffer and register for the AE_ERROR_QUEUE event, which later fires and notifies us of a batch of completions, allowing us to trim the tracker which in turn decrements the refcount and trims the replication backlog.

To maintain a net improvement on performance,MSG_ZEROCOPY is only used in situations where it should perform better than our typical write syscalls:

  1. MSG_ZEROCOPY is only used for writes that are over 10 KiB (by default, user configurable)
  2. MSG_ZEROCOPY is only used for remote connections
  3. MSG_ZEROCOPY is only used for plaintext connections
    1. Adding support for TLS should be feasible, but leaving this as an iterative improvement

This PR enables MSG_ZEROCOPY as the new default behavior for writes from the replication backlog that meet the aforementioned criteria on builds that support it. It also introduces a config to disable it altogether via --tcp-tx-zerocopy no.

Although this PR only implements it for replication, the core concept should be reusable at various other places where we do writes, so long as reference counting can be used to defer cleanup until a later epoll cycle is notified of the write completion. The prospects of combining this with something like IO uring via IORING_OP_SENDZC are especially exciting. We should aim to continue to evolve the core concept of the zeroCopyTracker to meet these needs.

Some other notable call outs in the implementation:

  1. The connection shutdown flow needs to be modified for connections that have ongoing zero copy writes. We are not allowed to touch this memory until the kernel has explicitly told us to do so. Instead of the normal close operation, we instead go into a draining phase where we first call shutdown, then await the zero copy tracker to reach zero length before later calling close and freeing the client in entirety.
  2. Using zero-copy should be a net improvement for memory, however it does change the semantics of how memory is accounted. Previously, TCP memory was not tracked at all by the Valkey process, but now we will be internalizing this TCP memory into the buffers that we hold for the kernel. Overall, this should reduce memory usage on the machine since each ongoing write for the same memory will not require a copy, but it may lead to situations where replication backlog and Valkey process memory looks larger in comparison to previous versions.
  3. For testing purposes, we add two DEBUG commands, one to allow us to simulate a slow replica via pausing completion notification events, and another to allow us to use zero-copy over loopback.
  4. Introduced some new stats to help track how many zero-copy writes are happening, how many are currently in flight, and how many connections are in the "draining" state. Additionally, added memory tracking info for the zeroCopyTracker.
  5. Since the kernel only uses uint32_t sequence numbers, it is important that we handle sequence number wrap around gracefully in zeroCopyTracker. A single replication link with 4 billion writes is not unexpected on a long-running instance.

Performance

Copying and pasting some initial performance comparisons from #1335:

Key size (B) Primary Throughput Delta Replica Throughput Delta
1024 +2.31% +2.31%
4096 +4.30% +4.30%
10240 +7.67% +7.67%
40960 +17.67% +17.67%
102400 +16.02% +16.02%
409600 +16.84% +1.57%

closes #1335

Signed-off-by: Jacob Murphy <[email protected]>
…regardless of tcp memory buffers

Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Signed-off-by: Jacob Murphy <[email protected]>
Copy link

codecov bot commented Jan 11, 2025

Codecov Report

Attention: Patch coverage is 91.10169% with 21 lines in your changes missing coverage. Please review.

Project coverage is 70.88%. Comparing base (e60990e) to head (89e614e).
Report is 7 commits behind head on unstable.

Files with missing lines Patch % Lines
src/zerocopy.c 90.83% 11 Missing ⚠️
src/ae.c 91.42% 3 Missing ⚠️
src/config.c 0.00% 3 Missing ⚠️
src/anet.c 50.00% 2 Missing ⚠️
src/networking.c 97.05% 1 Missing ⚠️
src/socket.c 94.44% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1543      +/-   ##
============================================
+ Coverage     70.77%   70.88%   +0.11%     
============================================
  Files           120      121       +1     
  Lines         65005    65274     +269     
============================================
+ Hits          46005    46268     +263     
- Misses        19000    19006       +6     
Files with missing lines Coverage Δ
src/ae_epoll.c 85.71% <100.00%> (+0.42%) ⬆️
src/connection.h 88.29% <100.00%> (+0.79%) ⬆️
src/debug.c 52.38% <100.00%> (+0.25%) ⬆️
src/object.c 82.18% <100.00%> (+0.03%) ⬆️
src/server.c 87.60% <100.00%> (+0.01%) ⬆️
src/server.h 100.00% <ø> (ø)
src/networking.c 88.57% <97.05%> (+0.09%) ⬆️
src/socket.c 91.83% <94.44%> (+0.21%) ⬆️
src/anet.c 72.96% <50.00%> (-0.28%) ⬇️
src/ae.c 79.50% <91.42%> (+1.85%) ⬆️
... and 2 more

... and 12 files with indirect coverage changes

Copy link
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some highlevel comments.
I still did not dive deep into the review of this, but it does sound promising.
I also wonder how this will look like when we do integrate io-uring.
there would be several options to go:

  1. no io-uring AND no ZERO-COPY support
  2. io-uring support but NO ZERO-COPY support
  3. both io-uring AND ZERO-COPY support

I think we would probably want to support all 3 modes, but we could also just invest in io-uring+ZERO copy support which could simplify the tracker code (am I correct in this observation?)

Comment on lines +480 to 488
if (invert) {
event_order[0] = AE_ERROR_QUEUE;
event_order[1] = AE_WRITABLE;
event_order[2] = AE_READABLE;
} else {
event_order[0] = AE_ERROR_QUEUE;
event_order[1] = AE_READABLE;
event_order[2] = AE_WRITABLE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (invert) {
event_order[0] = AE_ERROR_QUEUE;
event_order[1] = AE_WRITABLE;
event_order[2] = AE_READABLE;
} else {
event_order[0] = AE_ERROR_QUEUE;
event_order[1] = AE_READABLE;
event_order[2] = AE_WRITABLE;
}
event_order[0] = AE_ERROR_QUEUE;
event_order[1] = invert ? AE_WRITABLE : AE_READABLE;
event_order[0] = invert ? AE_READABLE : AE_WRITABLE;

Comment on lines +43 to +50
#define AE_READABLE 1 << 0 /* Fire when descriptor is readable. */
#define AE_WRITABLE 1 << 1 /* Fire when descriptor is writable. */
#define AE_BARRIER 1 << 2 /* With WRITABLE, never fire the event if the \
READABLE event already fired in the same event \
loop iteration. Useful when you want to persist \
things to disk before sending replies, and want \
to do that in a group fashion. */
#define AE_ERROR_QUEUE 1 << 3 /* Fire when descriptor has a message on the \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define AE_READABLE 1 << 0 /* Fire when descriptor is readable. */
#define AE_WRITABLE 1 << 1 /* Fire when descriptor is writable. */
#define AE_BARRIER 1 << 2 /* With WRITABLE, never fire the event if the \
READABLE event already fired in the same event \
loop iteration. Useful when you want to persist \
things to disk before sending replies, and want \
to do that in a group fashion. */
#define AE_ERROR_QUEUE 1 << 3 /* Fire when descriptor has a message on the \
#define AE_READABLE (1 << 0) /* Fire when descriptor is readable. */
#define AE_WRITABLE (1 << 1) /* Fire when descriptor is writable. */
#define AE_BARRIER (1 << 2) /* With WRITABLE, never fire the event if the \
READABLE event already fired in the same event \
loop iteration. Useful when you want to persist \
things to disk before sending replies, and want \
to do that in a group fashion. */
#define AE_ERROR_QUEUE (1 << 3) /* Fire when descriptor has a message on the \

@@ -3289,6 +3303,7 @@ standardConfig static_configs[] = {
createIntConfig("rdma-port", NULL, MODIFIABLE_CONFIG, 0, 65535, server.rdma_ctx_config.port, 0, INTEGER_CONFIG, NULL, updateRdmaPort),
createIntConfig("rdma-rx-size", NULL, IMMUTABLE_CONFIG, 64 * 1024, 16 * 1024 * 1024, server.rdma_ctx_config.rx_size, 1024 * 1024, INTEGER_CONFIG, NULL, NULL),
createIntConfig("rdma-completion-vector", NULL, IMMUTABLE_CONFIG, -1, 1024, server.rdma_ctx_config.completion_vector, -1, INTEGER_CONFIG, NULL, NULL),
createIntConfig("tcp-zerocopy-min-write-size", NULL, MODIFIABLE_CONFIG, 0, INT_MAX, server.tcp_zerocopy_min_write_size, CONFIG_DEFAULT_ZERO_COPY_MIN_WRITE_SIZE, INTEGER_CONFIG, NULL, NULL),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this falls under the "configurations we do not want users to mess with". agree 10K is the recommended sweet spot, but not sure if this needs any kind of tunability option ATM.
Consider dropping this config for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also didn't have it at first - but in tests it was much easier to deterministically trigger zero copy writes if I can play with the minimum write size.

configurations we do not want users to mess with

I can move it to a DEBUG command if we would prefer, if we have concerns about maintaining new configs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why it makes the test much easier? I thiknk DEBUG command is fine and also a HIDDEN config which is named prefixed/suffixed with 'debug' is fine IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why it makes the test much easier?

Ultimately we have no control over if a call to send will return <= the number of bytes we want it to send. So this leads to nondeterministic behavior - we can call SET with a key of length 10 KiB, and sometimes see that this doesn't trigger a zero-copy write since it is pieced into two send syscalls.

I think this falls under the "configurations we do not want users to mess with".

At a higher level - can I ask why it would be bad to expose such a config to the user? I can imagine that a combination of network hardware and kernel version may impact what setting makes sense here. I don't expect many folks to play with it - but it isn't a confusing setting and it isn't like maintenance cost will be high.

Comment on lines +1948 to +1959
size_t data_len = o->used - c->repl_data->ref_block_pos;
int use_zerocopy = shouldUseZeroCopy(c->conn, data_len);
if (use_zerocopy) {
/* Lazily enable zero copy at the socket level only on first use */
if (!c->zero_copy_tracker) {
connSetZeroCopy(c->conn, 1);
c->zero_copy_tracker = createZeroCopyTracker();
}
nwritten = zeroCopyWriteToConn(c->conn, o->buf + c->repl_data->ref_block_pos, data_len);
} else {
nwritten = connWrite(c->conn, o->buf + c->repl_data->ref_block_pos, data_len);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it could be encapsulated in the connection abstraction maybe? I wonder if the tracker should be a client or a connection owned? I think it makes more sense that it will be part of the connection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - on my initial pass I was thinking that the tracker fit better as a client-level abstraction simply because the logic is heavily tied to the replication backlog, and having the connection deal with replication backlog mechanics felt like a bad smell. But perhaps it would be better to start with a tracker that supports a more generic callback-based approach:

typedef void (*ZeroCopyWriteCompleteCallback)(void *);
/* Tracking struct used to trigger cleanup once a zero-copy write is finished */
typedef struct zeroCopyRecord {
    int active;
    ZeroCopyWriteCompleteCallback callback;
    void *privdata;
} zeroCopyRecord;

Then maybe connection exposes an API like:

int connWriteZeroCopy(connections *conn, char *bug, size_t len, ZeroCopyWriteCompleteCallback callback, void *privdata) {
    /* Do write and enqueue zeroCopyRecord */
    /* Later, when zero copy is done, connection will trigger the ZeroCopyWriteCompleteCallback with privdata */
}

Is this what you had in mind?

@murphyjacob4
Copy link
Contributor Author

I think we would probably want to support all 3 modes, but we could also just invest in io-uring+ZERO copy support which could simplify the tracker code (am I correct in this observation?)

Yeah I believe the polling mechanism will be different:

  • MSG_ZEROCOPY - poll the socket for error queue messages with EPOLLERR
  • io-uring+zero copy - polling a completion queue for completion queue entries with EPOLLIN

So the difference would be in having one vs. two handlers, but ideally once the completion notification is read we can merge into a common format and process it similarly for both.

There are limitations regarding io-uring adoption (https://www.phoronix.com/news/Google-Restricting-IO_uring) so having both modes would be ideal for those who are unable/unwilling to adopt IO uring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Psync Zero-Copy Transmission
2 participants