Jsonnet: fix KEDA autoscaling metric errors during rollouts #10013

duricanikolic · 2024-11-25T09:54:00Z

What this PR does

During an incident distributor memory usage indicated that we should scale them up, but the HPA held the replica count steady because KEDA was returning errors.

We traced this to a clause in the scaling metric queries added in #7691 which ignores resource metrics that don't have all expected datapoints over the last 15 minutes. This was done to protect against situations where the critical-prometheus feeding the metrics goes down and the resource usage is artifically seen as 0, leading to an unintended scaledown.

However, if all distributor pods are restarted during a rollout, the resource metrics for the new pods won't have all expected datapoints in the last 15 minutes, and neither will the shut down pods. This leads to no valid series for the metric, and since the above PR also introduced ingoreNullValues=false, instead of treating the lack of data as a 0, KEDA now reports an error. This leads to a period of KEDA failures after every rollout.

In the short term, we're guarding against a potential failure (critical-prom being down) but also suffering current ill effects (unavailable scaling during rollouts), so we should revert this change and find an alternate way to mitigate the critical-prom unavailability.

We think that ignoreNullValues=false should be enough to guard against a critical-prom outage. If the data is unavailable, rather than using values of 0, KEDA should now report errors, which should halt all scaling until the datasource is back.

This has already been tested on all our prod cells by @pr00se.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

Tests updated.
Documentation added.
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
about-versioning.md updated with experimental features.

Signed-off-by: Yuri Nikolic <[email protected]>

duricanikolic added the jsonnet label Nov 25, 2024

duricanikolic self-assigned this Nov 25, 2024

duricanikolic requested a review from a team as a code owner November 25, 2024 09:54

Jsonnet: fix KEDA autoscaling metric errors during rollouts

7af2e2e

Signed-off-by: Yuri Nikolic <[email protected]>

duricanikolic force-pushed the yuri/autoscaling branch from e7e1eb9 to 7af2e2e Compare November 25, 2024 10:01

pracucci approved these changes Nov 25, 2024

View reviewed changes

duricanikolic merged commit ca89adb into main Nov 25, 2024
29 checks passed

duricanikolic deleted the yuri/autoscaling branch November 25, 2024 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jsonnet: fix KEDA autoscaling metric errors during rollouts #10013

Jsonnet: fix KEDA autoscaling metric errors during rollouts #10013

duricanikolic commented Nov 25, 2024 •

edited

Loading

Jsonnet: fix KEDA autoscaling metric errors during rollouts #10013

Jsonnet: fix KEDA autoscaling metric errors during rollouts #10013

Conversation

duricanikolic commented Nov 25, 2024 • edited Loading

What this PR does

Which issue(s) this PR fixes or relates to

Checklist

duricanikolic commented Nov 25, 2024 •

edited

Loading