Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integrity checks before the offline scale #18877

Closed
shanicky opened this issue Oct 11, 2024 · 0 comments · Fixed by #18901
Closed

Add integrity checks before the offline scale #18877

shanicky opened this issue Oct 11, 2024 · 0 comments · Fixed by #18901
Assignees
Milestone

Comments

@shanicky
Copy link
Contributor

After multiple failures, we've discovered that many scaling issues arise during offline recovery. The root cause is typically that an incorrect graph is passed to the scale controller. The logic errors in scale itself are minimal, so we can add an integrity test to prevent these issues. Additionally, to avoid accidental false positives, we need a switch to forcibly skip this check.

We need to specifically check the following:

  1. When scaling offline, the cluster shouldn't have any inactive jobs/actors/fragments, as they would have been cleaned up earlier.
  2. For the Fragment, single fragment should only have one actor, and the fragments in upstream_fragment_ids should all exist.
  3. For the Actor, the downstream actors in the Dispatchers should all exist; the upstream actors in the Merger should all exist. The Merger needs to align with the upstream dispatcher mapping; the actor in the upstream_actor_id should all exist, and the VnodeBitmap should align with the Fragment's Distribution.
@github-actions github-actions bot added this to the release-2.1 milestone Oct 11, 2024
@shanicky shanicky self-assigned this Oct 12, 2024
@shanicky shanicky linked a pull request Oct 15, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant