Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable cluster blacklist TTL #738

Conversation

BCathcart
Copy link
Contributor

@BCathcart BCathcart commented Jul 2, 2024

Allows cluster admins to configure the blacklist TTL as needed to allow sufficient time for CLUSTER FORGET to be executed on every node in the cluster.

Config name cluster-blacklist-ttl; unit seconds; deault 60.

src/cluster_legacy.h Outdated Show resolved Hide resolved
src/cluster_legacy.h Outdated Show resolved Hide resolved
valkey.conf Outdated Show resolved Hide resolved
@hpatro
Copy link
Collaborator

hpatro commented Jul 2, 2024

Also you would need to signoff https://github.com/valkey-io/valkey/runs/26959011274

@hpatro
Copy link
Collaborator

hpatro commented Jul 2, 2024

@zuiderkwast We discussed this briefly during the summit. PTAL.

@hpatro hpatro requested review from zuiderkwast and madolson July 2, 2024 18:51
@BCathcart BCathcart force-pushed the bcathcart-configurable-cluster-blacklist-ttl branch 2 times, most recently from c3d8a9e to a4ce40b Compare July 2, 2024 19:04
@zuiderkwast
Copy link
Contributor

Hi @BCathcart! Have you experienced this problem after we added blacklist propagation in the cluster bus?

@BCathcart
Copy link
Contributor Author

Hi @BCathcart! Have you experienced this problem after we added blacklist propagation in the cluster bus?

Hey @zuiderkwast. Not exactly. We piggy-backed some additional functionality internally on top of the blacklist, which was the real motivation for making it configurable. It sounds like it would be a nice-to-have in OSS as well but maybe the blacklist propagation addresses this enough already.

@BCathcart BCathcart force-pushed the bcathcart-configurable-cluster-blacklist-ttl branch from 3a68fe9 to 0a613f0 Compare July 2, 2024 19:40
Copy link

codecov bot commented Jul 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.20%. Comparing base (9948f07) to head (2d1bbda).
Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #738      +/-   ##
============================================
- Coverage     70.23%   70.20%   -0.04%     
============================================
  Files           112      112              
  Lines         60602    60602              
============================================
- Hits          42566    42544      -22     
- Misses        18036    18058      +22     
Files Coverage Δ
src/cluster_legacy.c 85.71% <100.00%> (-0.07%) ⬇️
src/config.c 78.68% <ø> (ø)
src/server.h 100.00% <ø> (ø)

... and 13 files with indirect coverage changes

@zuiderkwast
Copy link
Contributor

We piggy-backed some additional functionality internally on top of the blacklist

Interesting. What was that?

If it's not a problem anyone is experiencing, I don't think we need to solve it. :)

In general, I think we should avoid adding configs. If this is actually a problem in a large cluster, could we let the value depend on parameters such as the size of the cluster and the node timeout instead? It's possible to estimate how long it takes for the blacklist to be propagated to all nodes in a cluster.

@BCathcart
Copy link
Contributor Author

Interesting. What was that?

It was a quick fix to help address a cluster message formatting mismatch between versions. It's very irrelevant to OSS so I won't get into the details.

I'm not feeling a strong need to make this configurable either now that I know about the blacklist propagation. Scaling the ttl based on cluster size and existing parameters seems reasonable to me, but I have no evidence the current 60 seconds is ever not enough time for the blacklist propagation in practice, so I don't think we need to jump into anything for now.

@hpatro @madolson Any more thoughts on this?

@hpatro
Copy link
Collaborator

hpatro commented Jul 2, 2024

@zuiderkwast In case of partition, the propagation might not arrive within a hard coded time period. I think making the parameter flexible for operator makes life easier.

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, we had the same configuration item in ours fork, so i think it is ok in my head.

In some cases, the node may not forget the other node in time (like partition or admin tools bug), and we don't it to add it back (at a configurable time).

@zuiderkwast zuiderkwast added the major-decision-pending Major decision pending by TSC team label Jul 4, 2024
madolson added a commit that referenced this pull request Jul 9, 2024
I noticed in #738 that we don't properly check ULong config boundaries
and made the change there. I'm pulling out that particular commit into
this PR since we don't know if we want to merge the configurable cluster
blacklist TTL yet.

---------

Signed-off-by: Brennan Cathcart <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
@madolson
Copy link
Member

madolson commented Jul 9, 2024

If it's not a problem anyone is experiencing, I don't think we need to solve it. :)

Oh, we've had this problem long before we were doing the piggy-backing :). I think it's especially bad in large clusters. Our internal control plane folks have asked for this feature for awhile.

It's possible to estimate how long it takes for the blacklist to be propagated to all nodes in a cluster.

This would be more ideal, but I'm not sure we'll ever be able to get it right. I suppose my opinion about configs is a bit more nuanced, which is that I am fine with lots of configs that most people will never need to touch, but are nice to have in case they need to touch them. I would also prefer to limit the number of configs on the datapath, but I really don't mind administration configs as much.

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Jul 10, 2024

I'm fine with adding the config. I just wanted to know that it's useful before we add it.

Just to be clear: If a user changes the config to one hour, does CLUSTER FORGET and afterwards changes the config back to one minute, then the timestamp in the blacklist will be one hour. The user can't add the node again during this time.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation LGTM

src/config.c Show resolved Hide resolved
valkey.conf Outdated Show resolved Hide resolved
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just added some wording updates.

@madolson
Copy link
Member

@valkey-io/core-team Not for valkey 8.0, but please vote if we want to add this config to future versions.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @BCathcart 👍

@hpatro
Copy link
Collaborator

hpatro commented Jul 10, 2024

@madolson we won't even add config changes to 8.0 further?

@BCathcart BCathcart force-pushed the bcathcart-configurable-cluster-blacklist-ttl branch from 2be463b to 84d4b95 Compare July 10, 2024 19:21
@BCathcart
Copy link
Contributor Author

The commits from my quick edits on github weren't automatically signed off so I had to do a little rebase. Hopefully it's good to go now.

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code LGTM.

src/config.c Outdated Show resolved Hide resolved
BCathcart and others added 5 commits July 11, 2024 17:18
Signed-off-by: Brennan Cathcart <[email protected]>
Signed-off-by: Brennan Cathcart <[email protected]>
Signed-off-by: Brennan Cathcart <[email protected]>
Update valkey.conf description

Signed-off-by: Brennan Cathcart <[email protected]>
@BCathcart BCathcart force-pushed the bcathcart-configurable-cluster-blacklist-ttl branch from 84d4b95 to 2d1bbda Compare July 11, 2024 17:19
@zuiderkwast
Copy link
Contributor

@valkey-io/core-team Not for valkey 8.0, but please vote if we want to add this config to future versions.

It seems ready to merge. I can see 5 core team members approving or upvoting. If we merge it before we cut the release branch, I see no reason why it can't be included in 8.0.

@madolson
Copy link
Member

Yeah, I guess it's before the cutoff so we can add it.

@madolson madolson added major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Jul 12, 2024
@zuiderkwast zuiderkwast merged commit 34649bd into valkey-io:unstable Jul 13, 2024
19 checks passed
@madolson madolson added the release-notes This issue should get a line item in the release notes label Jul 13, 2024
zvi-code pushed a commit to zvi-code/valkey-z that referenced this pull request Jul 14, 2024
I noticed in valkey-io#738 that we don't properly check ULong config boundaries
and made the change there. I'm pulling out that particular commit into
this PR since we don't know if we want to merge the configurable cluster
blacklist TTL yet.

---------

Signed-off-by: Brennan Cathcart <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
zvi-code pushed a commit to zvi-code/valkey-z that referenced this pull request Jul 14, 2024
Allows cluster admins to configure the blacklist TTL as needed to allow
sufficient time for `CLUSTER FORGET` to be executed on every node in the
cluster.

Config name `cluster-blacklist-ttl`; unit seconds; deault 60.

---------

Signed-off-by: Brennan Cathcart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-approved Major decision approved by TSC team release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants