Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) #774

Open
bentotten opened this issue Jul 12, 2024 · 15 comments · Fixed by #778

Comments

@bentotten
Copy link
Contributor

bentotten commented Jul 12, 2024

Describe the bug

During cluster setup, the shard id gets established through cluster message extension data. For backwards compatibility reasons, this is delayed until it is established that the node can properly receive these extensions, leading to a propagation delay for the shard ID. When an engine crashes or restarts before the shard ID has stabilized, the config file can become corrupted, leading to failure to restart the engine.

To reproduce
Set up a cluster and then immediately restart a node. It will (flakily) fail to restart due to a corrupted nodes.conf file - either because the replicas do not agree on the shard ID, or there is a shard ID mismatch.

Expected behavior

Engine restarts successfully.

Additional information

Related to this PR - #573

@bentotten bentotten changed the title [BUG] Corrupted nodes.conf when node is restarted before cluster shard ID stabilizes [BUG] Corrupted nodes.conf when node is restarted before cluster shard ID stabilizes (for 7.2) Jul 12, 2024
@bentotten bentotten changed the title [BUG] Corrupted nodes.conf when node is restarted before cluster shard ID stabilizes (for 7.2) [BUG] nodes.conf can be corrupted when node is restarted before cluster shard ID stabilizes (for 7.2) Jul 12, 2024
@PingXie
Copy link
Member

PingXie commented Jul 12, 2024

For backwards compatibility reasons, this is delayed until it is established that the node can properly receive these extensions

I think this issue exists before the backcompat fix already but I can see this situation is exacerbated by the further delay introduced by the backcompat fix.

I think #573 is safe to be backported to 7.2. The replicaof cycle shouldn't occur on 7.2 IMO. That said, I am curious to know if this is something that you have encountered on 7.2 already or this is more of a preventive measure that you are thinking of?

@bentotten
Copy link
Contributor Author

bentotten commented Jul 12, 2024

I think we found a way to avoid most of the issue without needing to reject the shard IDs from replicas - if we have the receiving node save the extensions supported flag on MEET for the sender, I believe it will send its extensions with the response PONG and remove a lot of this delay

@bentotten
Copy link
Contributor Author

bentotten commented Jul 12, 2024

Thinking more, the proposed approach might not fix the delay for nodes met through gossip, so maybe the replica shard id rejection is still needed

Update - apparently gossip will be ok

@bentotten
Copy link
Contributor Author

That said, I am curious to know if this is something that you have encountered on 7.2 already or this is more of a preventive measure that you are thinking of?

We are seeing it in test scenarios, yes

@bentotten
Copy link
Contributor Author

Alternatively, we could send two MEETs - one with extensions attached and one without, as any unsupported packets will be thrown out by the receiver

@jdork0
Copy link

jdork0 commented Aug 9, 2024

We see this issue on upgrade from redis 7.0 to valkey 7.2 (or 8.0.0-rc1). The scenario is:

  • all nodes start at 7.0 (so no shard-id in cluster nodes file)
  • one at a time, stop 7.0 on a node, start 7.2, repeat for each node
  • if any of the 7.2 nodes restart again prior to all the nodes having reached 7.2, their nodes file is corrupt and won't start

The cluster nodes file is corrupt as each node is just assigned a random shard-id if non exists in the file and the whole cluster hasn't come up at 7.2 yet for them to stabilize.

@PingXie
Copy link
Member

PingXie commented Aug 12, 2024

Right, #573 is not merged yet. We still have some issues with replicaOf cycles that seems to be a lot more prevalent with this change. We will need to root cause that first before we can consider the backport.

@jdork0
Copy link

jdork0 commented Aug 12, 2024

I backported #573 on top of 8.0.0-rc1 and I still see cases during upgrade before all the nodes have been upgraded where the 8.0.0 nodes do not have consistent shard-ids for all replicas in their nodes.conf file.

Should I raise a new issue for the upgrade scenario?

@bentotten
Copy link
Contributor Author

@jdork0 do you still see this issue after pulling this? #778

@jdork0
Copy link

jdork0 commented Aug 30, 2024

I will try next week and get back to you.

@jdork0
Copy link

jdork0 commented Sep 3, 2024

@bentotten I do still see issues when pulling #778.

If you have a cluster running redis 7.0, where cluster nodes file has no shard-id, then shutdown all 7.0 nodes and start just a single 8.0 node, then it will start will a corrupted cluster file. Try to restart that node again and it fails.

I think the problem is the way clusterLoadConfig generates different random shard-ids for each node, even those covering the same shards, if the file doesn't contain shard-ids. I don't think this should generate a corrupt file then rely on cluster communication to fix it.

@jdork0
Copy link

jdork0 commented Sep 10, 2024

@madolson, should I open an issue for the upgrade from 7.0 corruption I described above?

PingXie pushed a commit to PingXie/valkey that referenced this issue Sep 14, 2024
…rocessing (valkey-io#778)

For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.

This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.

Partially fixes valkey-io#774

---------

Signed-off-by: Ben Totten <[email protected]>
Co-authored-by: Ben Totten <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
PingXie pushed a commit to PingXie/valkey that referenced this issue Sep 14, 2024
…rocessing (valkey-io#778)

For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.

This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.

Partially fixes valkey-io#774

---------

Signed-off-by: Ben Totten <[email protected]>
Co-authored-by: Ben Totten <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
PingXie pushed a commit to PingXie/valkey that referenced this issue Sep 14, 2024
…rocessing (valkey-io#778)

For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.

This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.

Partially fixes valkey-io#774

---------

Signed-off-by: Ben Totten <[email protected]>
Co-authored-by: Ben Totten <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
enjoy-binbin pushed a commit to enjoy-binbin/valkey that referenced this issue Sep 14, 2024
…rocessing (valkey-io#778)

For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.

This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.

Partially fixes valkey-io#774

---------

Signed-off-by: Ben Totten <[email protected]>
Co-authored-by: Ben Totten <[email protected]>
PingXie pushed a commit to PingXie/valkey that referenced this issue Sep 14, 2024
…rocessing (valkey-io#778)

For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.

This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.

Partially fixes valkey-io#774

---------

Signed-off-by: Ben Totten <[email protected]>
Co-authored-by: Ben Totten <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
PingXie pushed a commit to PingXie/valkey that referenced this issue Sep 14, 2024
…rocessing (valkey-io#778)

For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.

This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.

Partially fixes valkey-io#774

---------

Signed-off-by: Ben Totten <[email protected]>
Co-authored-by: Ben Totten <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
PingXie pushed a commit that referenced this issue Sep 15, 2024
…rocessing (#778)

For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.

This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.

Partially fixes #774

---------

Signed-off-by: Ben Totten <[email protected]>
Co-authored-by: Ben Totten <[email protected]>
Signed-off-by: Ping Xie <[email protected]>
naglera pushed a commit to naglera/placeholderkv that referenced this issue Oct 10, 2024
…rocessing (valkey-io#778)

For backwards compatibility reasons, a node will wait until it receives
a cluster message with the extensions flag before sending its own
extensions. This leads to a delay in shard ID propagation that can
corrupt nodes.conf with inaccurate shard IDs if a node is restarted
before this can stabilize.

This fixes much of that delay by immediately triggering the
extensions-supported flag during the MEET processing and attaching the
node to the link, allowing the PONG reply to contain OSS extensions.

Partially fixes valkey-io#774

---------

Signed-off-by: Ben Totten <[email protected]>
Co-authored-by: Ben Totten <[email protected]>
Signed-off-by: naglera <[email protected]>
@madolson madolson reopened this Oct 25, 2024
@bentotten
Copy link
Contributor Author

We are seeing these crashes live, re-opening issues

@stevelipinski
Copy link
Contributor

I published a fix for this to redis. Ref to our local repo commit for one way to address this:
nokia/redis-redis@cd879cc

@pieturin
Copy link
Contributor

The issue seems to come from here: https://github.com/valkey-io/valkey/blob/unstable/src/cluster_legacy.c#L686-L691
As @stevelipinski mentioned, we can trust the primary's ShardId's value in this case instead of simply exiting the process.

@stevelipinski, would you be able to open a PR with your fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants