Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scoped RDB loading context and immediate abort flag #1173

Open
wants to merge 18 commits into
base: unstable
Choose a base branch
from

Conversation

naglera
Copy link
Contributor

@naglera naglera commented Oct 15, 2024

This PR introduces a new mechanism for temporarily changing the
server's loading_rio context during RDB loading operations. The new
RDB_SCOPED_LOADING_RIO macro allows for a scoped change of the
server.loading_rio value, ensuring that it's automatically restored
to its original value when the scope ends.

Introduces a dedicated flag to rio to signal immediate abort, preventing
potential use-after-free scenarios during replication disconnection in
dual-channel load. This ensures proper termination of rdbLoadRioWithLoadingCtx
when replication is cancelled due to connection loss on main connection.

Fixes #1152

@naglera naglera force-pushed the load-callback-crash branch from acedb47 to 4aee158 Compare October 15, 2024 12:06
Copy link

codecov bot commented Oct 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.84%. Comparing base (4f61034) to head (55e1eec).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1173      +/-   ##
============================================
- Coverage     70.85%   70.84%   -0.01%     
============================================
  Files           118      118              
  Lines         63591    63598       +7     
============================================
- Hits          45058    45057       -1     
- Misses        18533    18541       +8     
Files with missing lines Coverage Δ
src/rdb.c 76.64% <100.00%> (-0.17%) ⬇️
src/replication.c 87.41% <100.00%> (+0.11%) ⬆️
src/rio.h 100.00% <100.00%> (ø)
src/server.c 87.39% <100.00%> (+<0.01%) ⬆️
src/server.h 100.00% <ø> (ø)

... and 10 files with indirect coverage changes

@ranshid
Copy link
Member

ranshid commented Oct 15, 2024

General comment:
Although I agree this fix will work and at first glance I see no issue with it, I would like to suggest maybe to tackle the problem from a more holistic POV:

Basically we would like to have a method to tell the current load process to stop ASAP. This can also be achieved by adding an RIO flag (eg #define RIO_FLAG_STOP_ASAP (1 << 2)) and have rio check for this flag when it is performing different io operations. the only issue is that the rdb RIO is local to the rdbLoadRioWithCtx. we can, however keep a pointer in the server to the current active loading rio so that in any point during load we can set the flag RIO_FLAG_STOP_ASAP on the current loading rio. IMO this would be cleaner.

src/server.h Outdated
@@ -2038,6 +2038,7 @@ struct valkeyServer {
long long reploff;
long long read_reploff;
int dbid;
uint64_t close_asap : 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we piggyback on the existing state variables to detect when the sync has been aborted / primary connection dropped? Since cancelReplicationHandshake is called when the connection is dropped and updates:

server.repl_rdb_channel_state = REPL_DUAL_CHANNEL_STATE_NONE;

Can't we simply use server.repl_rdb_channel_state in rdbLoadProgressCallback?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably won't work and will create an issue when RDB dual channel isn't used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server.repl_rdb_channel_state will be equal to REPL_DUAL_CHANNEL_STATE_NONE also when dual channel is disabled

src/rdb.c Outdated
Comment on lines 2936 to 2940
if (server.repl_provisional_primary.close_asap == 1) {
serverLog(LL_WARNING, "Primary main connection dropped during RDB load callback");
return -1;
}
return 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm not following why we can't null out the connections here and use that instead of a new close_asap flag.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would still require to add logic to the rioConnRead/Write right? We could flag the rdb with RIO_FLAG_READ_ERROR.

Copy link
Contributor Author

@naglera naglera Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null out the fields will look the same as running sync with dual channel disabled.

It would still require to add logic to the rioConnRead/Write right?

right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could flag the rdb with RIO_FLAG_READ_ERROR

While using RIO flags could potentially offer a cleaner solution, it does present its own challenges. As you mentioned, we would need to maintain a pointer in the server to the current active loading RIO. This would require adding and maintaining a currently_loading_rdb field in the server struct.
This approach, while potentially more flexible for future use cases, introduces additional complexity and state management. It would require changes in multiple parts of the RDB load process to properly set, use, and clear this pointer.
The current proposed solution, while more specific to this use case, has the advantage of being more localized and doesn't require global state management.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose a third option is to remove the event handler for the replication connection before calling process events and then reinstalling it after the fact?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@madolson I am not sure I understand this proposal. We need to process the events while we are loading in order to keep feeding the local replication buffer AFAIK. We could (as a third option) do nothing when we identify the replication link was broken and complete the load (or let it disconnect as well), however I do feel that having the ability to bail out from a load is something we might find handy in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This idea made sense to me when I posted it but reading back it doesn't make sense, I might have just been missing something. More generally I want to move away from doing recursive calls for handling processing events, and it that world we can just skip it, but that is likely a much larger change than what we want to do here.

src/rio.h Outdated Show resolved Hide resolved
@naglera naglera force-pushed the load-callback-crash branch 2 times, most recently from a7aac51 to 6f9d737 Compare October 21, 2024 16:59
naglera and others added 5 commits October 29, 2024 11:48
…onnection handling

Introduces a dedicated flag in provisional primary struct to signal immediate
abort, preventing potential use-after-free scenarios during replication
disconnection in dual-channel load. This ensures proper termination of
rdbLoadRioWithLoadingCtx when replication is cancelled due to connection loss
on main connection.

Fixes valkey-io#1152

Signed-off-by: naglera <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
- Add test to consistently reproduce rdb load callback crash
- Avoid checking close_asap when no data was processed

Signed-off-by: naglera <[email protected]>
…ion disconnection handling"

This reverts commit b873d41.

Signed-off-by: naglera <[email protected]>
This commit introduces a new mechanism for temporarily changing the
server's loading_rio context during RDB loading operations. The new
RDB_SCOPED_LOADING_RIO macro allows for a scoped change of the
server.loading_rio value, ensuring that it's automatically restored
to its original value when the scope ends.

Signed-off-by: naglera <[email protected]>
@naglera naglera force-pushed the load-callback-crash branch from d5e83f6 to 9849350 Compare October 29, 2024 11:50
@naglera naglera changed the title Add ASAP abort flag to provisional primary for safer replication disconnection handling Add scoped RDB loading context and immediate abort flag Oct 29, 2024
@naglera naglera force-pushed the load-callback-crash branch from 6cc4f5e to eba00eb Compare October 29, 2024 12:10
Copy link
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @naglera looks promising! I like scoped actions, but I only want to make sure about the compiler support is not compromised.
BTW if it is not, we can consider having a generic ScopeGuard macro in Valkey

src/replication.c Outdated Show resolved Hide resolved
src/rio.h Show resolved Hide resolved
src/replication.c Outdated Show resolved Hide resolved
src/rdb.h Outdated

/* Macro to temporarily set server.loading_rio within a scope. */
#define RDB_SCOPED_LOADING_RIO(new_rio) \
__attribute__((cleanup(_restore_loading_rio))) rio *_old_rio __attribute__((unused)) = server.loading_rio; \
Copy link
Member

@ranshid ranshid Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice IMO. only thing is that I do not recall our compiler support scope in Valkey (@zuiderkwast DYK?)... for example is MSVC supported? I guess if we do we can just place a cleanup logic on the only 2 places where we return from the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You raise a valid point about compiler support. Regarding the return points, currently rdbLoadWithLoadingCtx has about 5 return points, and this number may increase in the future as the function evolves. If we don't enforce a single return path, and not use scope based variable, we risk introducing bugs in the future where the cleanup logic might be missed on new return paths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other way to handle this is just to split the function, so we move the rest of the logic into another function that we call and can make sure it's correctly restored on return.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My ask would be that we split the discussion, let's do it manually for now and then open a separate issue about supporting this? I think generally we should have a general macro for this as well, not just one specific for the server.loading_rio.

src/server.h Outdated Show resolved Hide resolved
wait_for_condition 500 1000 {
[string match "*slave*,state=wait_bgsave*,type=rdb-channel*" [$primary info replication]] &&
[string match "*slave*,state=bg_transfer*,type=main-channel*" [$primary info replication]] &&
[s -1 rdb_bgsave_in_progress] eq 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should wait to see that the sync was successful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use rdb-key-save-delay in this test, which intentionally slows down the RDB saving process. Due to potential context switches, the sync time can be unpredictable and might take longer than expected. This unpredictability could lead to test flakyness.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could reduce it to 0 and wait

naglera and others added 2 commits October 29, 2024 16:11
Co-authored-by: ranshid <[email protected]>
Signed-off-by: Amit Nagler <[email protected]>
Signed-off-by: naglera <[email protected]>
@@ -2833,6 +2833,8 @@ int readIntoReplDataBlock(connection *conn, replDataBufBlock *data_block, size_t
}
if (nread == 0) {
serverLog(LL_VERBOSE, "Provisional primary closed connection");
/* Signal ongoing RDB load to terminate gracefully */
if (server.loading_rio) rioCloseASAP(server.loading_rio);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be invoked on line 2832 as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, incase of connection state changes

Copy link
Member

@ranshid ranshid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve in order to indicate this generally looks fine to me. We do need to decide on the cleanup attribute use which I think is mostly supported with some exceptions. (At least we would probably get compilation error IMO)

@madolson madolson added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Nov 15, 2024
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me.


$primary config set repl-diskless-sync yes
$primary config set dual-channel-replication-enabled yes
$primary config set loglevel debug
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be set to debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It helped while writing the test. We don't need it anymore

set primary [srv 0 client]
set primary_host [srv 0 host]
set primary_port [srv 0 port]
set loglines [count_log_lines 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You set this value later, so it has no impact here.

Copy link
Contributor Author

@naglera naglera Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use it between the two sets for

wait_for_log_messages 0 {"*Loading RDB produced by Valkey version*"} $loglines 1000 10

Copy link
Member

@madolson madolson Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you set it again on line 1228. You actually set it three times.

…plica while syncing (only expect it to be eventually connected)

Signed-off-by: naglera <[email protected]>
src/server.h Show resolved Hide resolved
@naglera naglera force-pushed the load-callback-crash branch from 90cde6b to 41ea9e9 Compare November 26, 2024 08:27
@ranshid
Copy link
Member

ranshid commented Nov 27, 2024

@madolson I reviewed and approved. However since you were also reviewing and had some comments would wait for you to approve as well before we merge.

@ranshid
Copy link
Member

ranshid commented Dec 2, 2024

@naglera we need to rebase and resolve the conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)
Projects
Status: To be backported
Development

Successfully merging this pull request may close these issues.

[Test Failure] Engine crash during TLS test with dual channel replication
4 participants