Add scoped RDB loading context and immediate abort flag #1173

naglera · 2024-10-15T12:04:44Z

This PR introduces a new mechanism for temporarily changing the
server's loading_rio context during RDB loading operations. The new
RDB_SCOPED_LOADING_RIO macro allows for a scoped change of the
server.loading_rio value, ensuring that it's automatically restored
to its original value when the scope ends.

Introduces a dedicated flag to rio to signal immediate abort, preventing
potential use-after-free scenarios during replication disconnection in
dual-channel load. This ensures proper termination of rdbLoadRioWithLoadingCtx
when replication is cancelled due to connection loss on main connection.

Fixes #1152

codecov · 2024-10-15T12:21:17Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.84%. Comparing base (4f61034) to head (55e1eec).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1173      +/-   ##
============================================
- Coverage     70.85%   70.84%   -0.01%     
============================================
  Files           118      118              
  Lines         63591    63598       +7     
============================================
- Hits          45058    45057       -1     
- Misses        18533    18541       +8

Files with missing lines	Coverage Δ
src/rdb.c	`76.64% <100.00%> (-0.17%)`	⬇️
src/replication.c	`87.41% <100.00%> (+0.11%)`	⬆️
src/rio.h	`100.00% <100.00%> (ø)`
src/server.c	`87.39% <100.00%> (+<0.01%)`	⬆️
src/server.h	`100.00% <ø> (ø)`

... and 10 files with indirect coverage changes

ranshid · 2024-10-15T12:26:27Z

General comment:
Although I agree this fix will work and at first glance I see no issue with it, I would like to suggest maybe to tackle the problem from a more holistic POV:

Basically we would like to have a method to tell the current load process to stop ASAP. This can also be achieved by adding an RIO flag (eg #define RIO_FLAG_STOP_ASAP (1 << 2)) and have rio check for this flag when it is performing different io operations. the only issue is that the rdb RIO is local to the rdbLoadRioWithCtx. we can, however keep a pointer in the server to the current active loading rio so that in any point during load we can set the flag RIO_FLAG_STOP_ASAP on the current loading rio. IMO this would be cleaner.

xbasel · 2024-10-15T12:30:25Z

src/server.h

@@ -2038,6 +2038,7 @@ struct valkeyServer {
        long long reploff;
        long long read_reploff;
        int dbid;
+        uint64_t close_asap : 1;


Can we piggyback on the existing state variables to detect when the sync has been aborted / primary connection dropped? Since cancelReplicationHandshake is called when the connection is dropped and updates:

server.repl_rdb_channel_state = REPL_DUAL_CHANNEL_STATE_NONE;

Can't we simply use server.repl_rdb_channel_state in rdbLoadProgressCallback?

This probably won't work and will create an issue when RDB dual channel isn't used.

server.repl_rdb_channel_state will be equal to REPL_DUAL_CHANNEL_STATE_NONE also when dual channel is disabled

madolson · 2024-10-15T18:32:07Z

src/rdb.c

+    if (server.repl_provisional_primary.close_asap == 1) {
+        serverLog(LL_WARNING, "Primary main connection dropped during RDB load callback");
+        return -1;
+    }
+    return 0;


I guess I'm not following why we can't null out the connections here and use that instead of a new close_asap flag.

It would still require to add logic to the rioConnRead/Write right? We could flag the rdb with RIO_FLAG_READ_ERROR.

Null out the fields will look the same as running sync with dual channel disabled.

It would still require to add logic to the rioConnRead/Write right?

right

We could flag the rdb with RIO_FLAG_READ_ERROR

While using RIO flags could potentially offer a cleaner solution, it does present its own challenges. As you mentioned, we would need to maintain a pointer in the server to the current active loading RIO. This would require adding and maintaining a currently_loading_rdb field in the server struct.
This approach, while potentially more flexible for future use cases, introduces additional complexity and state management. It would require changes in multiple parts of the RDB load process to properly set, use, and clear this pointer.
The current proposed solution, while more specific to this use case, has the advantage of being more localized and doesn't require global state management.

I suppose a third option is to remove the event handler for the replication connection before calling process events and then reinstalling it after the fact?

@madolson I am not sure I understand this proposal. We need to process the events while we are loading in order to keep feeding the local replication buffer AFAIK. We could (as a third option) do nothing when we identify the replication link was broken and complete the load (or let it disconnect as well), however I do feel that having the ability to bail out from a load is something we might find handy in the future.

This idea made sense to me when I posted it but reading back it doesn't make sense, I might have just been missing something. More generally I want to move away from doing recursive calls for handling processing events, and it that world we can just skip it, but that is likely a much larger change than what we want to do here.

src/rio.h

…onnection handling Introduces a dedicated flag in provisional primary struct to signal immediate abort, preventing potential use-after-free scenarios during replication disconnection in dual-channel load. This ensures proper termination of rdbLoadRioWithLoadingCtx when replication is cancelled due to connection loss on main connection. Fixes valkey-io#1152 Signed-off-by: naglera <[email protected]>

Signed-off-by: Madelyn Olson <[email protected]>

- Add test to consistently reproduce rdb load callback crash - Avoid checking close_asap when no data was processed Signed-off-by: naglera <[email protected]>

…ion disconnection handling" This reverts commit b873d41. Signed-off-by: naglera <[email protected]>

This commit introduces a new mechanism for temporarily changing the server's loading_rio context during RDB loading operations. The new RDB_SCOPED_LOADING_RIO macro allows for a scoped change of the server.loading_rio value, ensuring that it's automatically restored to its original value when the scope ends. Signed-off-by: naglera <[email protected]>

Signed-off-by: naglera <[email protected]>

ranshid

Thank you @naglera looks promising! I like scoped actions, but I only want to make sure about the compiler support is not compromised.
BTW if it is not, we can consider having a generic ScopeGuard macro in Valkey

src/replication.c

src/rio.h

src/replication.c

ranshid · 2024-10-29T12:38:08Z

src/rdb.h

+
+/* Macro to temporarily set server.loading_rio within a scope. */
+#define RDB_SCOPED_LOADING_RIO(new_rio)                                                                        \
+    __attribute__((cleanup(_restore_loading_rio))) rio *_old_rio __attribute__((unused)) = server.loading_rio; \


very nice IMO. only thing is that I do not recall our compiler support scope in Valkey (@zuiderkwast DYK?)... for example is MSVC supported? I guess if we do we can just place a cleanup logic on the only 2 places where we return from the function?

You raise a valid point about compiler support. Regarding the return points, currently rdbLoadWithLoadingCtx has about 5 return points, and this number may increase in the future as the function evolves. If we don't enforce a single return path, and not use scope based variable, we risk introducing bugs in the future where the cleanup logic might be missed on new return paths.

The other way to handle this is just to split the function, so we move the rest of the logic into another function that we call and can make sure it's correctly restored on return.

My ask would be that we split the discussion, let's do it manually for now and then open a separate issue about supporting this? I think generally we should have a general macro for this as well, not just one specific for the server.loading_rio.

src/server.h

ranshid · 2024-10-29T12:43:48Z

tests/integration/dual-channel-replication.tcl

+            wait_for_condition 500 1000 {
+                [string match "*slave*,state=wait_bgsave*,type=rdb-channel*" [$primary info replication]] &&
+                [string match "*slave*,state=bg_transfer*,type=main-channel*" [$primary info replication]] &&
+                [s -1 rdb_bgsave_in_progress] eq 1


maybe we should wait to see that the sync was successful?

We use rdb-key-save-delay in this test, which intentionally slows down the RDB saving process. Due to potential context switches, the sync time can be unpredictable and might take longer than expected. This unpredictability could lead to test flakyness.

we could reduce it to 0 and wait

Co-authored-by: ranshid <[email protected]> Signed-off-by: Amit Nagler <[email protected]>

Signed-off-by: naglera <[email protected]>

xbasel · 2024-10-30T15:50:46Z

src/replication.c

@@ -2833,6 +2833,8 @@ int readIntoReplDataBlock(connection *conn, replDataBufBlock *data_block, size_t
    }
    if (nread == 0) {
        serverLog(LL_VERBOSE, "Provisional primary closed connection");
+        /* Signal ongoing RDB load to terminate gracefully */
+        if (server.loading_rio) rioCloseASAP(server.loading_rio);


Shouldn't this be invoked on line 2832 as well?

right, incase of connection state changes

Signed-off-by: naglera <[email protected]>

ranshid

I approve in order to indicate this generally looks fine to me. We do need to decide on the cleanup attribute use which I think is mostly supported with some exceptions. (At least we would probably get compilation error IMO)

tests/integration/dual-channel-replication.tcl

Signed-off-by: Madelyn Olson <[email protected]>

madolson

Mostly looks good to me.

madolson · 2024-11-15T22:57:14Z

tests/integration/dual-channel-replication.tcl

+
+    $primary config set repl-diskless-sync yes
+    $primary config set dual-channel-replication-enabled yes
+    $primary config set loglevel debug


Does this need to be set to debug?

It helped while writing the test. We don't need it anymore

madolson · 2024-11-15T22:58:43Z

tests/integration/dual-channel-replication.tcl

+    set primary [srv 0 client]
+    set primary_host [srv 0 host]
+    set primary_port [srv 0 port]
+    set loglines [count_log_lines 0]


You set this value later, so it has no impact here.

I use it between the two sets for

wait_for_log_messages 0 {"*Loading RDB produced by Valkey version*"} $loglines 1000 10

Yes, but you set it again on line 1228. You actually set it three times.

Signed-off-by: naglera <[email protected]>

…plica while syncing (only expect it to be eventually connected) Signed-off-by: naglera <[email protected]>

src/server.h

Signed-off-by: naglera <[email protected]>

ranshid · 2024-11-27T13:37:22Z

@madolson I reviewed and approved. However since you were also reviewing and had some comments would wait for you to approve as well before we merge.

ranshid · 2024-12-02T10:50:08Z

@naglera we need to rebase and resolve the conflicts

Signed-off-by: Amit Nagler <[email protected]>

naglera force-pushed the load-callback-crash branch from acedb47 to 4aee158 Compare October 15, 2024 12:06

xbasel reviewed Oct 15, 2024

View reviewed changes

madolson reviewed Oct 15, 2024

View reviewed changes

src/rio.h Outdated Show resolved Hide resolved

naglera force-pushed the load-callback-crash branch 2 times, most recently from a7aac51 to 6f9d737 Compare October 21, 2024 16:59

naglera and others added 5 commits October 29, 2024 11:48

Update src/rio.h

61ab0a5

Signed-off-by: Madelyn Olson <[email protected]>

Consistently reproduce crash and improve test reliability

4997941

- Add test to consistently reproduce rdb load callback crash - Avoid checking close_asap when no data was processed Signed-off-by: naglera <[email protected]>

Revert "Add ASAP abort flag to provisional primary for safer replicat…

f89b716

…ion disconnection handling" This reverts commit b873d41. Signed-off-by: naglera <[email protected]>

naglera force-pushed the load-callback-crash branch from d5e83f6 to 9849350 Compare October 29, 2024 11:50

naglera changed the title ~~Add ASAP abort flag to provisional primary for safer replication disconnection handling~~ Add scoped RDB loading context and immediate abort flag Oct 29, 2024

return 0 on async error during rioWrite (for symmetricality)

eba00eb

Signed-off-by: naglera <[email protected]>

naglera force-pushed the load-callback-crash branch from 6cc4f5e to eba00eb Compare October 29, 2024 12:10

ranshid reviewed Oct 29, 2024

View reviewed changes

naglera and others added 2 commits October 29, 2024 16:11

Update src/server.h

b82704f

Co-authored-by: ranshid <[email protected]> Signed-off-by: Amit Nagler <[email protected]>

Fix comments

4d8126b

Signed-off-by: naglera <[email protected]>

xbasel reviewed Oct 30, 2024

View reviewed changes

naglera added 3 commits October 31, 2024 08:03

Signal rio to be closed asap on connection state change

39f0217

Signed-off-by: naglera <[email protected]>

Test fix- Wait for replica online

88466b6

Signed-off-by: naglera <[email protected]>

typo fix

36ec221

Signed-off-by: naglera <[email protected]>

ranshid approved these changes Nov 11, 2024

View reviewed changes

madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Nov 15, 2024

madolson reviewed Nov 15, 2024

View reviewed changes

tests/integration/dual-channel-replication.tcl Outdated Show resolved Hide resolved

Update tests/integration/dual-channel-replication.tcl

0af35e3

Signed-off-by: Madelyn Olson <[email protected]>

madolson reviewed Nov 15, 2024

View reviewed changes

Wrapper for rdbLoadRioWithLoadingCtx for scoped RDB

cd9dca0

Signed-off-by: naglera <[email protected]>

naglera added 2 commits November 17, 2024 11:34

Decrease rdb key save delay

43b1e22

Signed-off-by: naglera <[email protected]>

after resetting rdb-key-save-delay to 0, don't expect to catch the re…

612e621

…plica while syncing (only expect it to be eventually connected) Signed-off-by: naglera <[email protected]>

ranshid reviewed Nov 26, 2024

View reviewed changes

src/server.h Show resolved Hide resolved

Remove redundent test's logline set

41ea9e9

Signed-off-by: naglera <[email protected]>

naglera force-pushed the load-callback-crash branch from 90cde6b to 41ea9e9 Compare November 26, 2024 08:27

nullify loading_rio on startup

c6f676e

Signed-off-by: naglera <[email protected]>

Merge branch 'unstable' into load-callback-crash

55e1eec

Signed-off-by: Amit Nagler <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scoped RDB loading context and immediate abort flag #1173

Add scoped RDB loading context and immediate abort flag #1173

naglera commented Oct 15, 2024 •

edited

Loading

codecov bot commented Oct 15, 2024 •

edited

Loading

ranshid commented Oct 15, 2024

xbasel Oct 15, 2024

xbasel Oct 15, 2024

naglera Oct 16, 2024

madolson Oct 15, 2024

ranshid Oct 16, 2024

naglera Oct 16, 2024 •

edited

Loading

naglera Oct 16, 2024

madolson Oct 19, 2024

ranshid Oct 20, 2024

madolson Oct 28, 2024

ranshid left a comment

ranshid Oct 29, 2024 •

edited

Loading

naglera Oct 29, 2024

madolson Nov 15, 2024

madolson Nov 15, 2024

ranshid Oct 29, 2024

naglera Oct 29, 2024

ranshid Oct 31, 2024

xbasel Oct 30, 2024

naglera Oct 31, 2024

ranshid left a comment

madolson left a comment

madolson Nov 15, 2024

naglera Nov 17, 2024

madolson Nov 15, 2024

naglera Nov 17, 2024 •

edited

Loading

madolson Nov 18, 2024 •

edited

Loading

ranshid commented Nov 27, 2024

ranshid commented Dec 2, 2024

Add scoped RDB loading context and immediate abort flag #1173

Are you sure you want to change the base?

Add scoped RDB loading context and immediate abort flag #1173

Conversation

naglera commented Oct 15, 2024 • edited Loading

codecov bot commented Oct 15, 2024 • edited Loading

Codecov Report

ranshid commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naglera Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ranshid left a comment

Choose a reason for hiding this comment

ranshid Oct 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ranshid left a comment

Choose a reason for hiding this comment

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

naglera Nov 17, 2024 • edited Loading

Choose a reason for hiding this comment

madolson Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

ranshid commented Nov 27, 2024

ranshid commented Dec 2, 2024

naglera commented Oct 15, 2024 •

edited

Loading

codecov bot commented Oct 15, 2024 •

edited

Loading

naglera Oct 16, 2024 •

edited

Loading

ranshid Oct 29, 2024 •

edited

Loading

naglera Nov 17, 2024 •

edited

Loading

madolson Nov 18, 2024 •

edited

Loading