Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smoothing factor for upper band following 50x spike #7

Open
iaco86 opened this issue Dec 14, 2024 · 6 comments
Open

Smoothing factor for upper band following 50x spike #7

iaco86 opened this issue Dec 14, 2024 · 6 comments

Comments

@iaco86
Copy link

iaco86 commented Dec 14, 2024

Hi all and thanks for the framework!
As I'm experimenting with it on real data, I noticed the following behavior.

On a very stable APIs, a big spike (e.g. p95 increase of 50/100 times) immediately followed by another stable period, skews the upper band for a long time.

I think this is given by the 26h smoothing factor applied to the stddev_1h, together with the high filter applied on anomaly:stddev_1h:filtered that's immediately filtering out the stddev after the spike.

If my understanding is correct, this might result in the framework masking anomalies for up to 26h following a big spike - see attached screenshot

Image

What could be a good solution to avoid such situations?

The only thing I could think of was to reduce the smoothing period of 26h, but that would still hide anomalies for that amount of time.

Thank you!

@jcreixell
Copy link
Collaborator

Hi @iaco86 , you are correct in your assessment.

The moving average acts as a smoothing function, which helps helps prevent some undesirable artifacts we encountered when using a raw stddev. However, it has a water-down effect on the bands for very stable metrics, causing them to take too long to expand when anomalies occur.

The high pass filter helps reducing the water-down effect. In addition, it provides a performance optimization, as it makes calculating the moving average over 26h very cheap. However, as you have observed, for very stable metrics it won't have any data points (everything has been filtered out), so the first time an anomaly occurs it will adjust the short term bands based on very few data points.

Paradoxically, after a large outlier, smaller anomalies (not as extreme) that are significant enough to pass the filter will actually reduce the band size, as the large spike will be averaged out with the new data points (as you can see in your example).

In practice, whether subsequent anomalies are missed or not depends on the difference in magnitude between them. In your example, another large spike is likely to trigger the alert a second time, as even with expanded bands, the bands are defined around the middle line, which is lagging behind the actual metric.

Coming back to your question, you could try playing with the threshold_by_covar constant so that the high pass filter is less aggressive, but it can be hard to get right as different metrics present different patterns, and if the threshold is too low it will lead to the water down effect described above and increased CPU usage. We decided to use a relatively conservative threshold to prevent this.

Another thing you could try is to tune the stddev_multiplier so that the bands are narrower and therefore less prone to false negatives, but it will make the alerts more sensitive overall.

We are also open to new ideas, so please let us know if you find better ways to deal with this problem. We have found the current solution to be good enough for our use cases and highly efficient, but I am sure there are myriad ways in which we could improve it.

@iaco86
Copy link
Author

iaco86 commented Dec 27, 2024

Thanks @jcreixell for the explanation!
I had a couple of ideas, and wanted to see if you guys already experimented with something like that:

  • an exponential decay function for the 26h smoothing function, to let it decay based on some parameter and multiplier. In this case though, I'm not sure how computationally expensive the metric would become, nor I could find a good formula to express that so far
  • a clamp_max on the metrics spikes, maybe based on a multiplier of the metric average, to prevent the stddev to grow too high. I am afraid though, this will hide otherwise important information, in term of anomalies that greatly diverge from the metric average

@jcreixell
Copy link
Collaborator

I believe clamp_max would not be possible to implement using a variable max parameter, because afaik the result of a query cannot be used as a scalar parameter in a function (I will double check this).

Exponential smoothing is more interesting and something that has been in the back of my mind for a while. The idea would be to replace the 26h average function with an exponential smoothing function. Prometheus has a built-in holt_winters function (recently renamed to double_exponential_smoothing and hidden behind a feature flag in 3.x, I believe the idea is to eventually get rid of them but if there are strong use cases maybe there is a way to challenge that). The function calculates a double exponential smoothing (taking trends into account), but can also be used to approximate a single exponential smoothing. While the function is computationally expensive, if applied to filtered data points using the high pass filter, it should be relatively fast and cheap (since most data points are filtered out).

I am currently on parental leave and won't have time to experiment with it for a while, but if you want to give it a try I am more than happy to help with that.

@iaco86
Copy link
Author

iaco86 commented Dec 29, 2024

Thanks for responding during your parental 👶 leave! :)

I had found about the holt_winters function through google gemini of all the means, but drove it crazy and admitted it doesn't understand it very well 🙄

Anyways, I played with it a bit, and came up with this, that seems to behave ok in the case I was debugging - see screenshot.
It grows slower, but decays after 1h and gets filtered out.

Image

I tried applying the holt_winters function to the already filtered anomaly:stddev_1h:filtered but that didn't have the effect I was hoping, since the function is filtered out too quickly in case of sudden spikes

      - record: anomaly:stddev_1h:holt_filtered
        expr: |-
          holt_winters(anomaly:stddev_1h[26h:], 0.1, 0.01) > anomaly:avg_1h * on() group_left anomaly:threshold_by_covar
...
      - record: anomaly:upper_band
        expr: |-
          max without(prediction_type)
          (
            label_replace(
                last_over_time(anomaly:avg_1h[2m]) + last_over_time(anomaly:stddev_1h:holt_filtered[2m]) * on() group_left anomaly:stddev_multiplier,
                "prediction_type", "short_term", "", ""
            )
            // the rest is unchanged
            or
            label_replace(
                last_over_time(anomaly:avg_1h[2m]) + last_over_time(anomaly:avg_1h[2m]) * on() group_left anomaly:margin_multiplier,
                "prediction_type", "margin", "", ""
            )
            or
            last_over_time(anomaly:upper_band_lt[10m])
          )

Let me know what you think - no rush!

@iaco86
Copy link
Author

iaco86 commented Jan 3, 2025

As a follow up, it looks like using sf = 0.8, tf = 0.01 as factors for the holt function, allows the smoothing to react faster to changes (since a higher sf gives less weight to older data).
See screenshot, comparing the result to the bands from the presentation.
Image

Another (simpler) formula that gives ok results, is something along the lines of

      - record: anomaly:stddev_st
        expr: |-
          avg_over_time(anomaly:stddev_1h:filtered[26h]) * (1 - exp(-(anomaly:stddev_1h > 1)))

that allows the function to decay promptly after some stability (stddev < 1) has been reached after a peak.

All these solutions though, seem to be pretty tied to range of the stddev calculation, which makes me think that (at least for the specific spike-followed-by-stability case), the result would be similar to have

      - record: anomaly:stddev_st
        expr: |-
          avg_over_time(anomaly:stddev_1h:filtered[1h])

Let me know what you think! :)

@jcreixell
Copy link
Collaborator

Hi, thank you for giving it a try! The first solution is more or less what I had in mind. Minimizing the trend parameter since it is not relevant here, and tuning the smoothing factor to an acceptable value (mostly empirical), exactly as you did. I see the point you make about it not working as expected when applied to the moving average, due to the filter.

The second one is quite interesting.

Now that we have a decay function, it would be interesting to explore how it could be combined with the current approach. For example, using a weighted approach or some other function like avg or similar.

Zooming out, the short term band is designed to look back 24h and decide what's normal and what's not based on past observations. The filter helps attenuate large spikes when enough variability is present, but for very stable metrics or when encountering the first spike, it will distort the band. If the spike keeps re-occurring, then the expanded band is accurate, but if it is a one-off, the distortion is undesirable. A decay function would help bring the band down to acceptable values when the spike does not happen again. Nevertheless, it should not go back to 0, as the increased variability caused by the spike should have an effect for the 24h period. At the end of the day, we want to detect when a metric changes too fast, unless the pattern has been recently seen within a 24h period.

Note that if we remove the filter and moving avg altogether and just do a stddev over 26h, there will be some natural attenuation happening (you can test this with the demo dashboard; you can also use it to add your approaches and compare them side by side!). The catch is that sustained spikes (rather than short ones) cause much higher distortion of the bands.

I hope this helps, I don't have a lot of time to do extensive testing right now, but pinging @xujiaxj, @jradhakrishnan @nandadev and @manoja1 in case they have some suggestions.

Once again thank you for helping improve this framework, and feel free to open a PR adding your ideas to the demo rules and dashboard!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants