-
Notifications
You must be signed in to change notification settings - Fork 702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs related to synchronous TLS connections blocking #837
Fix bugs related to synchronous TLS connections blocking #837
Conversation
Signed-off-by: naglera <[email protected]>
wait_for_log_messages $idx {"*Process is about to stop.*"} 0 2000 1 | ||
wait_for_condition 50 1000 { | ||
[exec ps -o state= -p $pid] eq "T" || | ||
[exec ps -o state= -p $pid] eq "Z" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you plan to catch that? I mean this is basically when we are waiting to collect the zombie, this is not the expected flow we are trying to catch...
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #837 +/- ##
============================================
- Coverage 70.38% 70.23% -0.16%
============================================
Files 112 112
Lines 61462 61470 +8
============================================
- Hits 43261 43173 -88
- Misses 18201 18297 +96
|
Co-authored-by: ranshid <[email protected]> Signed-off-by: naglera <[email protected]>
Signed-off-by: naglera <[email protected]>
@naglera please fix the issue headline as it now include 2 fixes. please also add explanation for the fixes in the top comment |
Signed-off-by: naglera <[email protected]>
Signed-off-by: naglera <[email protected]>
There are some new errors I haven't seen before in the extended tests: https://github.com/valkey-io/valkey/actions/runs/10146979329/job/28056324528?pr=837 |
Signed-off-by: naglera <[email protected]>
On busy hosts, rdb-key-save-delay may pause the process for more then expected due to long recurreing context switches Signed-off-by: naglera <[email protected]>
…lly succeed" This reverts commit 10c73e5. Signed-off-by: naglera <[email protected]>
This reverts commit 5f1cd17. Signed-off-by: naglera <[email protected]>
1.Replica recover rdb-connection killed 2.Replica recover main-connection killed We ca only catch bgsave while in progress or expect replica to compleat the sync but not both. Using rdb-key-save-delay is not predictable enough when machine has high load. Signed-off-by: naglera <[email protected]>
Signed-off-by: naglera <[email protected]>
Signed-off-by: naglera <[email protected]>
1. Test replica's buffer limit reached 2. dual-channel-replication fails when primary diskless disabled In both cases we should not count on rdb-key-save-delay to be precise on machines with high load. Signed-off-by: naglera <[email protected]>
Test replica unable to join dual channel replication sync after started we should not count on rdb-key-save-delay to be precise on machines with high load. Signed-off-by: naglera <[email protected]>
…rite. Dual-channel file descriptors are set to blocking mode in the main thread, followed by a blocking write. This sets the file descriptors to non-blocking if TLS is used (see `connTLSSyncWrite()`). The child process runs in blocking mode. If a write operation fails to write the entire buffer, SSL returns `SSL_ERROR_WANT_WRITE`. This error is ignored, causing the replica to fail in loading the RDB. This change sets file descriptors after the blocking write. Signed-off-by: xbasel <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Signed-off-by: naglera <[email protected]>
Fix "Test dual-channel-replication primary reject set-rdb-client after client killed". When replica is paused rdb child process can't recognize connection closed. Need to resume the replica in order to fail the sync. Signed-off-by: naglera <[email protected]>
When using blocking connection we can't normally close the connection at the main process context, since it may block the main proc if the replica is not responding. We are also unable to skip this since otherwise child process will continue the save. Signed-off-by: naglera <[email protected]>
- dual-channel-replication fails when primary diskless disabled - Replica recover rdb-connection killed In both tests we should make child proces sleep for shorter intervals so the save will be terminated on time Signed-off-by: naglera <[email protected]>
This reverts commit 75ff15d. Signed-off-by: naglera <[email protected]>
…nsfer error Currently lastbgsave_status is used in bgsave or disk-replication, and the target is the disk. In valkey-io#60, we update it when transfer error, i think it is mainly used in tests, so we can use log to replace it. It changes lastbgsave_status to err in this case, but it is strange that it does not set ok or err in the above if and the following else. Also noted this will affect stop-writes-on-bgsave-error. Signed-off-by: Binbin <[email protected]>
dual-channel-replication fails when primary diskless disabled - we should wait for bgproc to exit Signed-off-by: naglera <[email protected]>
- Fix TLS bug where connection were shutdown by primary's main process while the child process was still writing- causing main process to be blocked. - TLS connection fix -file descriptors are set to blocking mode in the main thread, followed by a blocking write. This sets the file descriptors to non-blocking if TLS is used (see `connTLSSyncWrite()`) (@xbasel). - Improve the reliability of dual-channel tests. Modify the pause mechanism to verify process status directly, rather than relying on log. - Ensure that `server.repl_offset` and `server.replid` are updated correctly when dual channel synchronization completes successfully. Thist led to failures in replication tests that validate replication IDs or compare replication offsets. --------- Signed-off-by: naglera <[email protected]> Signed-off-by: naglera <[email protected]> Signed-off-by: xbasel <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Signed-off-by: Binbin <[email protected]> Co-authored-by: ranshid <[email protected]> Co-authored-by: xbasel <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Co-authored-by: Binbin <[email protected]> Signed-off-by: mwish <[email protected]>
- Fix TLS bug where connection were shutdown by primary's main process while the child process was still writing- causing main process to be blocked. - TLS connection fix -file descriptors are set to blocking mode in the main thread, followed by a blocking write. This sets the file descriptors to non-blocking if TLS is used (see `connTLSSyncWrite()`) (@xbasel). - Improve the reliability of dual-channel tests. Modify the pause mechanism to verify process status directly, rather than relying on log. - Ensure that `server.repl_offset` and `server.replid` are updated correctly when dual channel synchronization completes successfully. Thist led to failures in replication tests that validate replication IDs or compare replication offsets. --------- Signed-off-by: naglera <[email protected]> Signed-off-by: naglera <[email protected]> Signed-off-by: xbasel <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Signed-off-by: Binbin <[email protected]> Co-authored-by: ranshid <[email protected]> Co-authored-by: xbasel <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Co-authored-by: Binbin <[email protected]> Signed-off-by: mwish <[email protected]>
- Fix TLS bug where connection were shutdown by primary's main process while the child process was still writing- causing main process to be blocked. - TLS connection fix -file descriptors are set to blocking mode in the main thread, followed by a blocking write. This sets the file descriptors to non-blocking if TLS is used (see `connTLSSyncWrite()`) (@xbasel). - Improve the reliability of dual-channel tests. Modify the pause mechanism to verify process status directly, rather than relying on log. - Ensure that `server.repl_offset` and `server.replid` are updated correctly when dual channel synchronization completes successfully. Thist led to failures in replication tests that validate replication IDs or compare replication offsets. --------- Signed-off-by: naglera <[email protected]> Signed-off-by: naglera <[email protected]> Signed-off-by: xbasel <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Signed-off-by: Binbin <[email protected]> Co-authored-by: ranshid <[email protected]> Co-authored-by: xbasel <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Co-authored-by: Binbin <[email protected]>
- Fix TLS bug where connection were shutdown by primary's main process while the child process was still writing- causing main process to be blocked. - TLS connection fix -file descriptors are set to blocking mode in the main thread, followed by a blocking write. This sets the file descriptors to non-blocking if TLS is used (see `connTLSSyncWrite()`) (@xbasel). - Improve the reliability of dual-channel tests. Modify the pause mechanism to verify process status directly, rather than relying on log. - Ensure that `server.repl_offset` and `server.replid` are updated correctly when dual channel synchronization completes successfully. Thist led to failures in replication tests that validate replication IDs or compare replication offsets. --------- Signed-off-by: naglera <[email protected]> Signed-off-by: naglera <[email protected]> Signed-off-by: xbasel <[email protected]> Signed-off-by: Madelyn Olson <[email protected]> Signed-off-by: Binbin <[email protected]> Co-authored-by: ranshid <[email protected]> Co-authored-by: xbasel <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Co-authored-by: Binbin <[email protected]>
This change prevents unintended side effects on connection state and improves consistency with non-TLS sync operations. For example, when invoking `connTLSSyncRead` with a blocking file descriptor, the mode is switched to non-blocking upon `connTLSSyncRead` exit. If the code assumes the file descriptor remains blocking and calls the normal `read` expecting it to block, it may result in a short read. This caused a crash in dual-channel, which was fixed in this PR by relocating `connBlock()`: #837 Signed-off-by: xbasel <[email protected]>
This PR is based on: #12109 valkey-io/valkey#60 Closes: #11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: valkey-io/valkey#837 valkey-io/valkey#1173 valkey-io/valkey#804 valkey-io/valkey#945 valkey-io/valkey#989 --------- Co-authored-by: Yuan Wang <[email protected]> Co-authored-by: debing.sun <[email protected]> Co-authored-by: Moti Cohen <[email protected]> Co-authored-by: naglera <[email protected]> Co-authored-by: Amit Nagler <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Co-authored-by: Binbin <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Ran Shidlansik <[email protected]> Co-authored-by: ranshid <[email protected]> Co-authored-by: xbasel <[email protected]>
This PR is based on: #12109 valkey-io/valkey#60 Closes: #11678 **Motivation** During a full sync, when master is delivering RDB to the replica, incoming write commands are kept in a replication buffer in order to be sent to the replica once RDB delivery is completed. If RDB delivery takes a long time, it might create memory pressure on master. Also, once a replica connection accumulates replication data which is larger than output buffer limits, master will kill replica connection. This may cause a replication failure. The main benefit of the rdb channel replication is streaming incoming commands in parallel to the RDB delivery. This approach shifts replication stream buffering to the replica and reduces load on master. We do this by opening another connection for RDB delivery. The main channel on replica will be receiving replication stream while rdb channel is receiving the RDB. This feature also helps to reduce master's main process CPU load. By opening a dedicated connection for the RDB transfer, the bgsave process has access to the new connection and it will stream RDB directly to the replicas. Before this change, due to TLS connection restriction, the bgsave process was writing RDB bytes to a pipe and the main process was forwarding it to the replica. This is no longer necessary, the main process can avoid these expensive socket read/write syscalls. It also means RDB delivery to replica will be faster as it avoids this step. In summary, replication will be faster and master's performance during full syncs will improve. **Implementation steps** 1. When replica connects to the master, it sends 'rdb-channel-repl' as part of capability exchange to let master to know replica supports rdb channel. 2. When replica lacks sufficient data for PSYNC, master sends +RDBCHANNELSYNC reply with replica's client id. As the next step, the replica opens a new connection (rdb-channel) and configures it against the master with the appropriate capabilities and requirements. It also sends given client id back to master over rdbchannel, so that master can associate these channels. (initial replica connection will be referred as main-channel) Then, replica requests fullsync using the RDB channel. 3. Prior to forking, master attaches the replica's main channel to the replication backlog to deliver replication stream starting at the snapshot end offset. 4. The master main process sends replication stream via the main channel, while the bgsave process sends the RDB directly to the replica via the rdb-channel. Replica accumulates replication stream in a local buffer, while the RDB is being loaded into the memory. 5. Once the replica completes loading the rdb, it drops the rdb channel and streams the accumulated replication stream into the db. Sync is completed. **Some details** - Currently, rdbchannel replication is supported only if `repl-diskless-sync` is enabled on master. Otherwise, replication will happen over a single connection as in before. - On replica, there is a limit to replication stream buffering. Replica uses a new config `replica-full-sync-buffer-limit` to limit number of bytes to accumulate. If it is not set, replica inherits `client-output-buffer-limit <replica>` hard limit config. If we reach this limit, replica stops accumulating. This is not a failure scenario though. Further accumulation will happen on master side. Depending on the configured limits on master, master may kill the replica connection. **API changes in INFO output:** 1. New replica state: `send_bulk_and_stream`. Indicates full sync is still in progress for this replica. It is receiving replication stream and rdb in parallel. ``` slave0:ip=127.0.0.1,port=5002,state=send_bulk_and_stream,offset=0,lag=0 ``` Replica state changes in steps: - First, replica sends psync and receives +RDBCHANNELSYNC :`state=wait_bgsave` - After replica connects with rdbchannel and delivery starts: `state=send_bulk_and_stream` - After full sync: `state=online` 2. On replica side, replication stream buffering metrics: - replica_full_sync_buffer_size: Currently accumulated replication stream data in bytes. - replica_full_sync_buffer_peak: Peak number of bytes that this instance accumulated in the lifetime of the process. ``` replica_full_sync_buffer_size:20485 replica_full_sync_buffer_peak:1048560 ``` **API changes in CLIENT LIST** In `client list` output, rdbchannel clients will have 'C' flag in addition to 'S' replica flag: ``` id=11 addr=127.0.0.1:39108 laddr=127.0.0.1:5001 fd=14 name= age=5 idle=5 flags=SC db=0 sub=0 psub=0 ssub=0 multi=-1 watch=0 qbuf=0 qbuf-free=0 argv-mem=0 multi-mem=0 rbs=1024 rbp=0 obl=0 oll=0 omem=0 tot-mem=1920 events=r cmd=psync user=default redir=-1 resp=2 lib-name= lib-ver= io-thread=0 ``` **Config changes:** - `replica-full-sync-buffer-limit`: Controls how much replication data replica can accumulate during rdbchannel replication. If it is not set, a value of 0 means replica will inherit `client-output-buffer-limit <replica>` hard limit config to limit accumulated data. - `repl-rdb-channel` config is added as a hidden config. This is mostly for testing as we need to support both rdbchannel replication and the older single connection replication (to keep compatibility with older versions and rdbchannel replication will not be enabled if repl-diskless-sync is not enabled). it affects both the master (not to respond to rdb channel requests), and the replica (not to declare capability) **Internal API changes:** Changes that were introduced to Redis replication: - New replication capability is added to replconf command: `capa rdb-channel-repl`. Indicates replica is capable of rdb channel replication. Replica sends it when it connects to master along with other capabilities. - If replica needs fullsync, master replies `+RDBCHANNELSYNC <client-id>` to the replica's PSYNC request. - When replica opens rdbchannel connection, as part of replconf command, it sends `rdb-channel 1` to let master know this is rdb channel. Also, it sends `main-ch-client-id <client-id>` as part of replconf command so master can associate channels. **Testing:** As rdbchannel replication is enabled by default, we run whole test suite with it. Though, as we need to support both rdbchannel and single connection replication, we'll be running some tests twice with `repl-rdb-channel yes/no` config. **Replica state diagram** ``` * * Replica state machine * * * Main channel state * ┌───────────────────┐ * │RECEIVE_PING_REPLY │ * └────────┬──────────┘ * │ +PONG * ┌────────▼──────────┐ * │SEND_HANDSHAKE │ RDB channel state * └────────┬──────────┘ ┌───────────────────────────────┐ * │+OK ┌───► RDB_CH_SEND_HANDSHAKE │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_AUTH_REPLY │ │ REPLCONF main-ch-client-id <clientid> * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_AUTH_REPLY │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_PORT_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_REPLCONF_REPLY│ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_IP_REPLY │ │ │ +OK * └────────┬──────────┘ │ ┌──────────────▼────────────────┐ * │+OK │ │ RDB_CH_RECEIVE_FULLRESYNC │ * ┌────────▼──────────┐ │ └──────────────┬────────────────┘ * │RECEIVE_CAPA_REPLY │ │ │+FULLRESYNC * └────────┬──────────┘ │ │Rdb delivery * │ │ ┌──────────────▼────────────────┐ * ┌────────▼──────────┐ │ │ RDB_CH_RDB_LOADING │ * │SEND_PSYNC │ │ └──────────────┬────────────────┘ * └─┬─────────────────┘ │ │ Done loading * │PSYNC (use cached-master) │ │ * ┌─▼─────────────────┐ │ │ * │RECEIVE_PSYNC_REPLY│ │ ┌────────────►│ Replica streams replication * └─┬─────────────────┘ │ │ │ buffer into memory * │ │ │ │ * │+RDBCHANNELSYNC client-id │ │ │ * ├──────┬───────────────────┘ │ │ * │ │ Main channel │ │ * │ │ accumulates repl data │ │ * │ ┌──▼────────────────┐ │ ┌───────▼───────────┐ * │ │ REPL_TRANSFER ├───────┘ │ CONNECTED │ * │ └───────────────────┘ └────▲───▲──────────┘ * │ │ │ * │ │ │ * │ +FULLRESYNC ┌───────────────────┐ │ │ * ├────────────────► REPL_TRANSFER ├────┘ │ * │ └───────────────────┘ │ * │ +CONTINUE │ * └──────────────────────────────────────────────┘ */ ``` ----- This PR also contains changes and ideas from: valkey-io/valkey#837 valkey-io/valkey#1173 valkey-io/valkey#804 valkey-io/valkey#945 valkey-io/valkey#989 --------- Co-authored-by: Yuan Wang <[email protected]> Co-authored-by: debing.sun <[email protected]> Co-authored-by: Moti Cohen <[email protected]> Co-authored-by: naglera <[email protected]> Co-authored-by: Amit Nagler <[email protected]> Co-authored-by: Madelyn Olson <[email protected]> Co-authored-by: Binbin <[email protected]> Co-authored-by: Viktor Söderqvist <[email protected]> Co-authored-by: Ping Xie <[email protected]> Co-authored-by: Ran Shidlansik <[email protected]> Co-authored-by: ranshid <[email protected]> Co-authored-by: xbasel <[email protected]>
connTLSSyncWrite()
) (@xbasel).server.repl_offset
andserver.replid
are updated correctly when dual channel synchronization completes successfully. Thist led to failures in replication tests that validate replication IDs or compare replication offsets.