You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After multiple failures, we've discovered that many scaling issues arise during offline recovery. The root cause is typically that an incorrect graph is passed to the scale controller. The logic errors in scale itself are minimal, so we can add an integrity test to prevent these issues. Additionally, to avoid accidental false positives, we need a switch to forcibly skip this check.
We need to specifically check the following:
When scaling offline, the cluster shouldn't have any inactive jobs/actors/fragments, as they would have been cleaned up earlier.
For the Fragment, single fragment should only have one actor, and the fragments in upstream_fragment_ids should all exist.
For the Actor, the downstream actors in the Dispatchers should all exist; the upstream actors in the Merger should all exist. The Merger needs to align with the upstream dispatcher mapping; the actor in the upstream_actor_id should all exist, and the VnodeBitmap should align with the Fragment's Distribution.
The text was updated successfully, but these errors were encountered:
After multiple failures, we've discovered that many scaling issues arise during offline recovery. The root cause is typically that an incorrect graph is passed to the scale controller. The logic errors in scale itself are minimal, so we can add an integrity test to prevent these issues. Additionally, to avoid accidental false positives, we need a switch to forcibly skip this check.
We need to specifically check the following:
upstream_fragment_ids
should all exist.upstream_actor_id
should all exist, and theVnodeBitmap
should align with the Fragment's Distribution.The text was updated successfully, but these errors were encountered: