Add integrity checks before the offline scale #18877

shanicky · 2024-10-11T14:40:46Z

After multiple failures, we've discovered that many scaling issues arise during offline recovery. The root cause is typically that an incorrect graph is passed to the scale controller. The logic errors in scale itself are minimal, so we can add an integrity test to prevent these issues. Additionally, to avoid accidental false positives, we need a switch to forcibly skip this check.

We need to specifically check the following:

When scaling offline, the cluster shouldn't have any inactive jobs/actors/fragments, as they would have been cleaned up earlier.
For the Fragment, single fragment should only have one actor, and the fragments in upstream_fragment_ids should all exist.
For the Actor, the downstream actors in the Dispatchers should all exist; the upstream actors in the Merger should all exist. The Merger needs to align with the upstream dispatcher mapping; the actor in the upstream_actor_id should all exist, and the VnodeBitmap should align with the Fragment's Distribution.

The text was updated successfully, but these errors were encountered:

shanicky added the type/feature label Oct 11, 2024

github-actions bot added this to the release-2.1 milestone Oct 11, 2024

shanicky self-assigned this Oct 12, 2024

shanicky linked a pull request Oct 15, 2024 that will close this issue

feat: try add integrity check before offline scale #18901

Merged

3 tasks

shanicky closed this as completed in #18901 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add integrity checks before the offline scale #18877

Add integrity checks before the offline scale #18877

shanicky commented Oct 11, 2024

Add integrity checks before the offline scale #18877

Add integrity checks before the offline scale #18877

Comments

shanicky commented Oct 11, 2024