Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NEW] Reply Offload #1353

Open
alexander-shabanov opened this issue Nov 25, 2024 · 6 comments
Open

[NEW] Reply Offload #1353

alexander-shabanov opened this issue Nov 25, 2024 · 6 comments

Comments

@alexander-shabanov
Copy link

Reply Offload

The idea for this proposal is brought up @touitou-dan and @uriyage.

Problem

In Valkey, when main thread builds a reply to a command, it copies data from an robj to client’s reply buffer (i.e. client cob). Later, when reply buffers are written to client’s connection, this data is copied again from the reply buffer by write/writev. So, robj data is copied twice.

Proposition

We suggest to optimize reply handling and eliminate one data copy done on main thread as follows. If IO threads are active it will eliminate completely expensive memory access to robj value (robj->ptr) on the main thread as well.

The main thread will write a pointer to robj into a reply buffer instead of writing robj data. The thread writing to client’s connection, either IO thread or main thread if IO threads inactive, will write corresponding part of reply to client’s connection directly from the robj object. Since regular data and pointers will be mixed within the reply buffers, a serialization approach will be necessary to organize the data in the reply buffers.

The writing thread will need to build offloaded replies from robj pointers on the fly and use writev to write to client’s connection because reply data will be scattered - part in reply buffers (i.e. regular non offloaded replies) and part in robj (i.e. offloaded replies). For example, if “GET greeting” command is issued and “greeting” key is associated with “hello” value then valkey is expected to reply $5\r\nhello\r\n . So simplified code in writing thread will look like this:

robj *value_obj;
memcpy(&value_obj, c->buf + some_pos, sizeof(value_obj));
char *str = value_obj->ptr;
size_t str_len = stringObjectLen(value_obj);
            
struct iovec iov[3];
char* prefix = "$5\r\n";
char* suffix = "\r\n";
iov[0].iov_base = prefix;
iov[0].iov_len = 4;
iov[1].iov_base = str;
iov[1].iov_len = str_len;
iov[2].iov_base = suffix;
iov[2].iov_len = 2;

connWritev(c->conn, iov, 3)

The proper generalized implementation will write to client’s connection content of all replies, regular and offloaded ones, using single writev call.

The performance improvement has been measured using proof of concept implementation and setup described at this article. The TPS for GET commands for data size 512 byte increased from 1.07 million to 1.3 million requests per second, for data size 4096 increased from 750,000 to 900,000. The TPS for GET commands for data size 512 byte with iothreads disabled no noticeable change, with and without around 190,000.

The Reply Offload technique is based on ideas outlined at Reaching 1 million requests per second on a single Valkey instance and provides an additional improvement to major ones implemented at #758, #763, #861.

Scope

This document proposes to apply Reply Offload to string objects. Specifically, to commands using addReplyBulk for building reply with robj objects of type OBJ_STRING and encoding OBJ_ENCODING_RAW . The Reply Offload is straightforward for this case and will benefit frequently used commands like GET and MGET . In future application of Reply Offload will be extended for more complex object types.

Implementation

Existing _addReplyToBuffer and _addReplyProtoToList functions will be extended to prepend raw data written into reply buffers with CLIENT_REPLY_PAYLOAD_DATA type and corresponding size (i.e. payload header).

Additionally, new _addReplyOffloadToBuffer and _addReplyOffloadToList will be introduced to pack robj pointer into reply buffers using payload header with CLIENT_REPLY_PAYLOAD_ROBJ_PTR type .

The main thread will detect replies eligible for offloading (i.e. robj with OBJ_ENCODING_RAW encoding), increment robj reference counter and offload them using _addReplyOffloadToBuffer / _addReplyOffloadToList. The robj reference counter will be decremented back on the main thread when write is completed in postWriteToClient callback.

A new header will be inserted only if _addReply functions need to write payload type different from the last one; otherwise, last header will be updated and raw data or ptr will be appended.

In the diagram below: reply buffer [16k] is c→buf in the code and reply list is c→reply.
ReplyOffloadSerialization drawio(4)

typedef enum {
    CLIENT_REPLY_PAYLOAD_DATA = 1,
    CLIENT_REPLY_PAYLOAD_ROBJ_PTR = 2,
} clientReplyPayloadType;

/* Reply payload header */
typedef struct payloadHeader {
    uint8_t type;
    uint32_t size;
} payloadHeader;

In the writing thread, either IO thread or main if IO threads inactive, if a client in reply offload mode than _writeToClient function will always choose writevToClient flow. The writevToClient will process data in reply buffers according to their headers. Specifically, it will pack reply offload data (robj->ptr) directly into iov (array of iovec) as explained in the Proposition section.

Configuration

The “io-threads-reply-offload” config setting will be introduced to enable or disable reply offload optimization in the code. It should be gracefully applied (i.e. switch on / off on a specific client only when no in-flight replies).

Implementation Challenges

The challenges for possible Reply Offload implementations are:

  • mix raw data and pointers inside reply buffers
  • maintain strict order of replies
  • minimize memory consumption increase by client output buffers
  • eliminate/minimize decrease of performance for use cases (commands) not suitable for reply offload
  • minimize complexity of code changes

Alternative Implementation

Above we suggested implementation that strives to optimally address all challenges. Below is a short description of less optimal alternative.

Alternative more simple implementation can be introduction of flag field on clientReplyBlock struct with possible values CLIENT_REPLY_PAYLOAD_RAW_DATA and CLIENT_REPLY_PAYLOAD_RAW_OBJ_PTR and putting into buf of clientReplyBlock either raw data or robj pointer(s) with no mixing of data and pointers in the same buf. So, each time when a payload different from last one should be added to reply buffers a new clientReplyBlock should be allocated and added to reply list. The default buf on client struct can be used the same way, either for raw data or for robj pointer(s)

The alternative implementation has more profound negative impact on memory consumption by client output buffers and on performance in mixed workloads (e.g. cmd1, cmd2, cmd3, cmd4 - where cmd1 and cmd3 suitable for offload and cmd2 and cmd4 not suitable will require to create at least 3 clientReplyBlock objects).

@madolson
Copy link
Member

Is there benefit to enabling this for io threads disabled as well?

@alexander-shabanov
Copy link
Author

In theory the benefit should be even without io threads. However, tests with proof of concept show neither improvement nor degradation that is surprising. Planning to dive deep on it when will have more mature implementation.

@zuiderkwast
Copy link
Contributor

This looks like a good idea.

If there is no degradation in a single-threaded setup, then it can at least save memory used by clients. This is another benefit.

Copying is cheap for small objects to buffers that are already in the CPU cache, but for huge objects (megabytes), I suppose we should see some improvement even for single-threaded.

@alexander-shabanov
Copy link
Author

I am finishing full implementation in few days. Going to test impact on performance in both single-threaded mode and with IO threads with 512 byte and several other (bigger) sizes of objects.

@murphyjacob4
Copy link
Contributor

Another thing that would be interesting is using https://docs.kernel.org/networking/msg_zerocopy.html in conjunction. I have been playing with MSG_ZEROCOPY in the context of PSYNC and have seen some positive results so far: #1335. Do note that it is only useful when the write is over a certain size.

Something like:

#define PREFIX "$5\r\n"
#define SUFFIX "\r\n"

writeObjectToSocketNoCopy(int fd, robj *my_robj) {
   struct iovec iov[3];

   iov[0].iov_base = (void *)PREFIX;
   iov[0].iov_len = strlen(PREFIX);

   my_robj->refcount++;
   iov[1].iov_base = my_robj->ptr;
   iov[1].iov_len = stringObjectLen(my_robj);

   iov[2].iov_base = (void *)SUFFIX;
   iov[2].iov_len = strlen(SUFFIX);

   struct msghdr msg = {0};
   msg.msg_iov = iov;
   msg.msg_iovlen = 3;

   ssize_t bytes_sent = sendmsg(fd, &msg, MSG_ZEROCOPY);
}

/* Triggered when zero copy is done, by epoll on main thread */
onZeroCopyMessage(robj *my_robj) {
   decrRefCount(my_robj);
}

@alexander-shabanov
Copy link
Author

alexander-shabanov commented Dec 8, 2024

@murphyjacob4 MSG_ZEROCOPY is interesting capability that should be evaluated. According to https://docs.kernel.org/networking/msg_zerocopy.html it provides benefit starting from 10K data size. It may allow to get the same performance with lower number of I/O threads (i.e. CPU cores).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants