[BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) #774

bentotten · 2024-07-12T03:02:38Z

Describe the bug

During cluster setup, the shard id gets established through cluster message extension data. For backwards compatibility reasons, this is delayed until it is established that the node can properly receive these extensions, leading to a propagation delay for the shard ID. When an engine crashes or restarts before the shard ID has stabilized, the config file can become corrupted, leading to failure to restart the engine.

To reproduce
Set up a cluster and then immediately restart a node. It will (flakily) fail to restart due to a corrupted nodes.conf file - either because the replicas do not agree on the shard ID, or there is a shard ID mismatch.

Expected behavior

Engine restarts successfully.

Additional information

Related to this PR - #573

PingXie · 2024-07-12T07:19:27Z

For backwards compatibility reasons, this is delayed until it is established that the node can properly receive these extensions

I think this issue exists before the backcompat fix already but I can see this situation is exacerbated by the further delay introduced by the backcompat fix.

I think #573 is safe to be backported to 7.2. The replicaof cycle shouldn't occur on 7.2 IMO. That said, I am curious to know if this is something that you have encountered on 7.2 already or this is more of a preventive measure that you are thinking of?

bentotten · 2024-07-12T12:45:03Z

I think we found a way to avoid most of the issue without needing to reject the shard IDs from replicas - if we have the receiving node save the extensions supported flag on MEET for the sender, I believe it will send its extensions with the response PONG and remove a lot of this delay

bentotten · 2024-07-12T13:21:41Z

Thinking more, the proposed approach might not fix the delay for nodes met through gossip, so maybe the replica shard id rejection is still needed

Update - apparently gossip will be ok

bentotten · 2024-07-15T14:48:27Z

That said, I am curious to know if this is something that you have encountered on 7.2 already or this is more of a preventive measure that you are thinking of?

We are seeing it in test scenarios, yes

bentotten · 2024-07-16T17:56:18Z

Alternatively, we could send two MEETs - one with extensions attached and one without, as any unsupported packets will be thrown out by the receiver

jdork0 · 2024-08-09T17:27:51Z

We see this issue on upgrade from redis 7.0 to valkey 7.2 (or 8.0.0-rc1). The scenario is:

all nodes start at 7.0 (so no shard-id in cluster nodes file)
one at a time, stop 7.0 on a node, start 7.2, repeat for each node
if any of the 7.2 nodes restart again prior to all the nodes having reached 7.2, their nodes file is corrupt and won't start

The cluster nodes file is corrupt as each node is just assigned a random shard-id if non exists in the file and the whole cluster hasn't come up at 7.2 yet for them to stabilize.

PingXie · 2024-08-12T03:26:24Z

Right, #573 is not merged yet. We still have some issues with replicaOf cycles that seems to be a lot more prevalent with this change. We will need to root cause that first before we can consider the backport.

jdork0 · 2024-08-12T12:47:26Z

I backported #573 on top of 8.0.0-rc1 and I still see cases during upgrade before all the nodes have been upgraded where the 8.0.0 nodes do not have consistent shard-ids for all replicas in their nodes.conf file.

Should I raise a new issue for the upgrade scenario?

bentotten · 2024-08-29T22:20:44Z

@jdork0 do you still see this issue after pulling this? #778

jdork0 · 2024-08-30T14:18:04Z

I will try next week and get back to you.

jdork0 · 2024-09-03T12:06:00Z

@bentotten I do still see issues when pulling #778.

If you have a cluster running redis 7.0, where cluster nodes file has no shard-id, then shutdown all 7.0 nodes and start just a single 8.0 node, then it will start will a corrupted cluster file. Try to restart that node again and it fails.

I think the problem is the way clusterLoadConfig generates different random shard-ids for each node, even those covering the same shards, if the file doesn't contain shard-ids. I don't think this should generate a corrupt file then rely on cluster communication to fix it.

jdork0 · 2024-09-10T11:29:53Z

@madolson, should I open an issue for the upgrade from 7.0 corruption I described above?

…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <[email protected]> Co-authored-by: Ben Totten <[email protected]> Signed-off-by: Ping Xie <[email protected]>

…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <[email protected]> Co-authored-by: Ben Totten <[email protected]>

…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <[email protected]> Co-authored-by: Ben Totten <[email protected]> Signed-off-by: Ping Xie <[email protected]>

…rocessing (#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes #774 --------- Signed-off-by: Ben Totten <[email protected]> Co-authored-by: Ben Totten <[email protected]> Signed-off-by: Ping Xie <[email protected]>

…rocessing (valkey-io#778) For backwards compatibility reasons, a node will wait until it receives a cluster message with the extensions flag before sending its own extensions. This leads to a delay in shard ID propagation that can corrupt nodes.conf with inaccurate shard IDs if a node is restarted before this can stabilize. This fixes much of that delay by immediately triggering the extensions-supported flag during the MEET processing and attaching the node to the link, allowing the PONG reply to contain OSS extensions. Partially fixes valkey-io#774 --------- Signed-off-by: Ben Totten <[email protected]> Co-authored-by: Ben Totten <[email protected]> Signed-off-by: naglera <[email protected]>

bentotten · 2024-10-25T20:24:04Z

We are seeing these crashes live, re-opening issues

stevelipinski · 2024-10-25T20:36:28Z

I published a fix for this to redis. Ref to our local repo commit for one way to address this:
nokia/redis-redis@cd879cc

pieturin · 2024-10-31T19:02:36Z

The issue seems to come from here: https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L686-L691
As @stevelipinski mentioned, we can trust the primary's ShardId's value in this case instead of simply exiting the process.

@stevelipinski, would you be able to open a PR with your fix?

bentotten changed the title ~~[BUG] Corrupted nodes.conf when node is restarted before cluster shard ID stabilizes~~ [BUG] Corrupted nodes.conf when node is restarted before cluster shard ID stabilizes (for 7.2) Jul 12, 2024

bentotten changed the title ~~[BUG] Corrupted nodes.conf when node is restarted before cluster shard ID stabilizes (for 7.2)~~ [BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) Jul 12, 2024

bentotten mentioned this issue Jul 12, 2024

For MEETs, save the extensions support flag immediately during MEET processing #778

Merged

madolson closed this as completed in #778 Sep 10, 2024

madolson closed this as completed in affbea5 Sep 10, 2024

madolson reopened this Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) #774

[BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) #774

bentotten commented Jul 12, 2024 •

edited

Loading

PingXie commented Jul 12, 2024

bentotten commented Jul 12, 2024 •

edited

Loading

bentotten commented Jul 12, 2024 •

edited

Loading

bentotten commented Jul 15, 2024

bentotten commented Jul 16, 2024

jdork0 commented Aug 9, 2024

PingXie commented Aug 12, 2024

jdork0 commented Aug 12, 2024

bentotten commented Aug 29, 2024

jdork0 commented Aug 30, 2024

jdork0 commented Sep 3, 2024

jdork0 commented Sep 10, 2024

bentotten commented Oct 25, 2024

stevelipinski commented Oct 25, 2024

pieturin commented Oct 31, 2024

[BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) #774

[BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) #774

Comments

bentotten commented Jul 12, 2024 • edited Loading

PingXie commented Jul 12, 2024

bentotten commented Jul 12, 2024 • edited Loading

bentotten commented Jul 12, 2024 • edited Loading

bentotten commented Jul 15, 2024

bentotten commented Jul 16, 2024

jdork0 commented Aug 9, 2024

PingXie commented Aug 12, 2024

jdork0 commented Aug 12, 2024

bentotten commented Aug 29, 2024

jdork0 commented Aug 30, 2024

jdork0 commented Sep 3, 2024

jdork0 commented Sep 10, 2024

bentotten commented Oct 25, 2024

stevelipinski commented Oct 25, 2024

pieturin commented Oct 31, 2024

bentotten commented Jul 12, 2024 •

edited

Loading

bentotten commented Jul 12, 2024 •

edited

Loading

bentotten commented Jul 12, 2024 •

edited

Loading