Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jsonnet: fix KEDA autoscaling metric errors during rollouts #10013

Merged
merged 1 commit into from
Nov 25, 2024

Conversation

duricanikolic
Copy link
Contributor

@duricanikolic duricanikolic commented Nov 25, 2024

What this PR does

During an incident distributor memory usage indicated that we should scale them up, but the HPA held the replica count steady because KEDA was returning errors.

We traced this to a clause in the scaling metric queries added in #7691 which ignores resource metrics that don't have all expected datapoints over the last 15 minutes. This was done to protect against situations where the critical-prometheus feeding the metrics goes down and the resource usage is artifically seen as 0, leading to an unintended scaledown.

However, if all distributor pods are restarted during a rollout, the resource metrics for the new pods won't have all expected datapoints in the last 15 minutes, and neither will the shut down pods. This leads to no valid series for the metric, and since the above PR also introduced ingoreNullValues=false, instead of treating the lack of data as a 0, KEDA now reports an error. This leads to a period of KEDA failures after every rollout.

In the short term, we're guarding against a potential failure (critical-prom being down) but also suffering current ill effects (unavailable scaling during rollouts), so we should revert this change and find an alternate way to mitigate the critical-prom unavailability.

We think that ignoreNullValues=false should be enough to guard against a critical-prom outage. If the data is unavailable, rather than using values of 0, KEDA should now report errors, which should halt all scaling until the datasource is back.

This has already been tested on all our prod cells by @pr00se.

Which issue(s) this PR fixes or relates to

Fixes #

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@duricanikolic duricanikolic self-assigned this Nov 25, 2024
@duricanikolic duricanikolic requested a review from a team as a code owner November 25, 2024 09:54
@duricanikolic duricanikolic merged commit ca89adb into main Nov 25, 2024
29 checks passed
@duricanikolic duricanikolic deleted the yuri/autoscaling branch November 25, 2024 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants