[NEW] Reply Offload #1353

alexander-shabanov · 2024-11-25T10:36:05Z

Reply Offload

The idea for this proposal is brought up @touitou-dan and @uriyage.

Problem

In Valkey, when main thread builds a reply to a command, it copies data from an robj to client’s reply buffer (i.e. client cob). Later, when reply buffers are written to client’s connection, this data is copied again from the reply buffer by write/writev. So, robj data is copied twice.

Proposition

We suggest to optimize reply handling and eliminate one data copy done on main thread as follows. If IO threads are active it will eliminate completely expensive memory access to robj value (robj->ptr) on the main thread as well.

The main thread will write a pointer to robj into a reply buffer instead of writing robj data. The thread writing to client’s connection, either IO thread or main thread if IO threads inactive, will write corresponding part of reply to client’s connection directly from the robj object. Since regular data and pointers will be mixed within the reply buffers, a serialization approach will be necessary to organize the data in the reply buffers.

The writing thread will need to build offloaded replies from robj pointers on the fly and use writev to write to client’s connection because reply data will be scattered - part in reply buffers (i.e. regular non offloaded replies) and part in robj (i.e. offloaded replies). For example, if “GET greeting” command is issued and “greeting” key is associated with “hello” value then valkey is expected to reply $5\r\nhello\r\n . So simplified code in writing thread will look like this:

robj *value_obj;
memcpy(&value_obj, c->buf + some_pos, sizeof(value_obj));
char *str = value_obj->ptr;
size_t str_len = stringObjectLen(value_obj);
            
struct iovec iov[3];
char* prefix = "$5\r\n";
char* suffix = "\r\n";
iov[0].iov_base = prefix;
iov[0].iov_len = 4;
iov[1].iov_base = str;
iov[1].iov_len = str_len;
iov[2].iov_base = suffix;
iov[2].iov_len = 2;

connWritev(c->conn, iov, 3)

The proper generalized implementation will write to client’s connection content of all replies, regular and offloaded ones, using single writev call.

The performance improvement has been measured using proof of concept implementation and setup described at this article. The TPS for GET commands for data size 512 byte increased from 1.07 million to 1.3 million requests per second, for data size 4096 increased from 750,000 to 900,000. The TPS for GET commands for data size 512 byte with iothreads disabled no noticeable change, with and without around 190,000.

The Reply Offload technique is based on ideas outlined at Reaching 1 million requests per second on a single Valkey instance and provides an additional improvement to major ones implemented at #758, #763, #861.

Scope

This document proposes to apply Reply Offload to string objects. Specifically, to commands using addReplyBulk for building reply with robj objects of type OBJ_STRING and encoding OBJ_ENCODING_RAW . The Reply Offload is straightforward for this case and will benefit frequently used commands like GET and MGET . In future application of Reply Offload will be extended for more complex object types.

Implementation

Existing _addReplyToBuffer and _addReplyProtoToList functions will be extended to prepend raw data written into reply buffers with CLIENT_REPLY_PAYLOAD_DATA type and corresponding size (i.e. payload header).

Additionally, new _addReplyOffloadToBuffer and _addReplyOffloadToList will be introduced to pack robj pointer into reply buffers using payload header with CLIENT_REPLY_PAYLOAD_ROBJ_PTR type .

The main thread will detect replies eligible for offloading (i.e. robj with OBJ_ENCODING_RAW encoding), increment robj reference counter and offload them using _addReplyOffloadToBuffer / _addReplyOffloadToList. The robj reference counter will be decremented back on the main thread when write is completed in postWriteToClient callback.

A new header will be inserted only if _addReply functions need to write payload type different from the last one; otherwise, last header will be updated and raw data or ptr will be appended.

In the diagram below: reply buffer [16k] is c→buf in the code and reply list is c→reply.

typedef enum {
    CLIENT_REPLY_PAYLOAD_DATA = 1,
    CLIENT_REPLY_PAYLOAD_ROBJ_PTR = 2,
} clientReplyPayloadType;

/* Reply payload header */
typedef struct payloadHeader {
    uint8_t type;
    uint32_t size;
} payloadHeader;

In the writing thread, either IO thread or main if IO threads inactive, if a client in reply offload mode than _writeToClient function will always choose writevToClient flow. The writevToClient will process data in reply buffers according to their headers. Specifically, it will pack reply offload data (robj->ptr) directly into iov (array of iovec) as explained in the Proposition section.

Configuration

The “io-threads-reply-offload” config setting will be introduced to enable or disable reply offload optimization in the code. It should be gracefully applied (i.e. switch on / off on a specific client only when no in-flight replies).

Implementation Challenges

The challenges for possible Reply Offload implementations are:

mix raw data and pointers inside reply buffers
maintain strict order of replies
minimize memory consumption increase by client output buffers
eliminate/minimize decrease of performance for use cases (commands) not suitable for reply offload
minimize complexity of code changes

Alternative Implementation

Above we suggested implementation that strives to optimally address all challenges. Below is a short description of less optimal alternative.

Alternative more simple implementation can be introduction of flag field on clientReplyBlock struct with possible values CLIENT_REPLY_PAYLOAD_RAW_DATA and CLIENT_REPLY_PAYLOAD_RAW_OBJ_PTR and putting into buf of clientReplyBlock either raw data or robj pointer(s) with no mixing of data and pointers in the same buf. So, each time when a payload different from last one should be added to reply buffers a new clientReplyBlock should be allocated and added to reply list. The default buf on client struct can be used the same way, either for raw data or for robj pointer(s)

The alternative implementation has more profound negative impact on memory consumption by client output buffers and on performance in mixed workloads (e.g. cmd1, cmd2, cmd3, cmd4 - where cmd1 and cmd3 suitable for offload and cmd2 and cmd4 not suitable will require to create at least 3 clientReplyBlock objects).

The text was updated successfully, but these errors were encountered:

madolson · 2024-11-25T16:25:30Z

Is there benefit to enabling this for io threads disabled as well?

alexander-shabanov · 2024-11-26T13:56:22Z

In theory the benefit should be even without io threads. However, tests with proof of concept show neither improvement nor degradation that is surprising. Planning to dive deep on it when will have more mature implementation.

zuiderkwast · 2024-12-03T10:14:16Z

This looks like a good idea.

If there is no degradation in a single-threaded setup, then it can at least save memory used by clients. This is another benefit.

Copying is cheap for small objects to buffers that are already in the CPU cache, but for huge objects (megabytes), I suppose we should see some improvement even for single-threaded.

alexander-shabanov · 2024-12-03T13:35:30Z

I am finishing full implementation in few days. Going to test impact on performance in both single-threaded mode and with IO threads with 512 byte and several other (bigger) sizes of objects.

murphyjacob4 · 2024-12-05T21:41:17Z

Another thing that would be interesting is using https://docs.kernel.org/networking/msg_zerocopy.html in conjunction. I have been playing with MSG_ZEROCOPY in the context of PSYNC and have seen some positive results so far: #1335. Do note that it is only useful when the write is over a certain size.

Something like:

#define PREFIX "$5\r\n"
#define SUFFIX "\r\n"

writeObjectToSocketNoCopy(int fd, robj *my_robj) {
   struct iovec iov[3];

   iov[0].iov_base = (void *)PREFIX;
   iov[0].iov_len = strlen(PREFIX);

   my_robj->refcount++;
   iov[1].iov_base = my_robj->ptr;
   iov[1].iov_len = stringObjectLen(my_robj);

   iov[2].iov_base = (void *)SUFFIX;
   iov[2].iov_len = strlen(SUFFIX);

   struct msghdr msg = {0};
   msg.msg_iov = iov;
   msg.msg_iovlen = 3;

   ssize_t bytes_sent = sendmsg(fd, &msg, MSG_ZEROCOPY);
}

/* Triggered when zero copy is done, by epoll on main thread */
onZeroCopyMessage(robj *my_robj) {
   decrRefCount(my_robj);
}

alexander-shabanov · 2024-12-08T13:37:35Z

@murphyjacob4 MSG_ZEROCOPY is interesting capability that should be evaluated. According to https://docs.kernel.org/networking/msg_zerocopy.html it provides benefit starting from 10K data size. It may allow to get the same performance with lower number of I/O threads (i.e. CPU cores).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Reply Offload #1353

[NEW] Reply Offload #1353

alexander-shabanov commented Nov 25, 2024

madolson commented Nov 25, 2024

alexander-shabanov commented Nov 26, 2024

zuiderkwast commented Dec 3, 2024

alexander-shabanov commented Dec 3, 2024

murphyjacob4 commented Dec 5, 2024

alexander-shabanov commented Dec 8, 2024 •

edited

Loading

[NEW] Reply Offload #1353

[NEW] Reply Offload #1353

Comments

alexander-shabanov commented Nov 25, 2024

Reply Offload

Problem

Proposition

Scope

Implementation

Configuration

Implementation Challenges

Alternative Implementation

madolson commented Nov 25, 2024

alexander-shabanov commented Nov 26, 2024

zuiderkwast commented Dec 3, 2024

alexander-shabanov commented Dec 3, 2024

murphyjacob4 commented Dec 5, 2024

alexander-shabanov commented Dec 8, 2024 • edited Loading

alexander-shabanov commented Dec 8, 2024 •

edited

Loading