Jsonnet: fix KEDA autoscaling metric errors during rollouts #10013
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does
During an incident distributor memory usage indicated that we should scale them up, but the HPA held the replica count steady because KEDA was returning errors.
We traced this to a clause in the scaling metric queries added in #7691 which ignores resource metrics that don't have all expected datapoints over the last 15 minutes. This was done to protect against situations where the critical-prometheus feeding the metrics goes down and the resource usage is artifically seen as 0, leading to an unintended scaledown.
However, if all distributor pods are restarted during a rollout, the resource metrics for the new pods won't have all expected datapoints in the last 15 minutes, and neither will the shut down pods. This leads to no valid series for the metric, and since the above PR also introduced
ingoreNullValues=false
, instead of treating the lack of data as a 0, KEDA now reports an error. This leads to a period of KEDA failures after every rollout.In the short term, we're guarding against a potential failure (critical-prom being down) but also suffering current ill effects (unavailable scaling during rollouts), so we should revert this change and find an alternate way to mitigate the critical-prom unavailability.
We think that
ignoreNullValues=false
should be enough to guard against a critical-prom outage. If the data is unavailable, rather than using values of 0, KEDA should now report errors, which should halt all scaling until the datasource is back.This has already been tested on all our prod cells by @pr00se.
Which issue(s) this PR fixes or relates to
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]
.about-versioning.md
updated with experimental features.