-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smoothing factor for upper band following 50x spike #7
Comments
Hi @iaco86 , you are correct in your assessment. The moving average acts as a smoothing function, which helps helps prevent some undesirable artifacts we encountered when using a raw stddev. However, it has a water-down effect on the bands for very stable metrics, causing them to take too long to expand when anomalies occur. The high pass filter helps reducing the water-down effect. In addition, it provides a performance optimization, as it makes calculating the moving average over 26h very cheap. However, as you have observed, for very stable metrics it won't have any data points (everything has been filtered out), so the first time an anomaly occurs it will adjust the short term bands based on very few data points. Paradoxically, after a large outlier, smaller anomalies (not as extreme) that are significant enough to pass the filter will actually reduce the band size, as the large spike will be averaged out with the new data points (as you can see in your example). In practice, whether subsequent anomalies are missed or not depends on the difference in magnitude between them. In your example, another large spike is likely to trigger the alert a second time, as even with expanded bands, the bands are defined around the middle line, which is lagging behind the actual metric. Coming back to your question, you could try playing with the Another thing you could try is to tune the We are also open to new ideas, so please let us know if you find better ways to deal with this problem. We have found the current solution to be good enough for our use cases and highly efficient, but I am sure there are myriad ways in which we could improve it. |
Thanks @jcreixell for the explanation!
|
I believe Exponential smoothing is more interesting and something that has been in the back of my mind for a while. The idea would be to replace the 26h average function with an exponential smoothing function. Prometheus has a built-in I am currently on parental leave and won't have time to experiment with it for a while, but if you want to give it a try I am more than happy to help with that. |
Hi, thank you for giving it a try! The first solution is more or less what I had in mind. Minimizing the trend parameter since it is not relevant here, and tuning the smoothing factor to an acceptable value (mostly empirical), exactly as you did. I see the point you make about it not working as expected when applied to the moving average, due to the filter. The second one is quite interesting. Now that we have a decay function, it would be interesting to explore how it could be combined with the current approach. For example, using a weighted approach or some other function like avg or similar. Zooming out, the short term band is designed to look back 24h and decide what's normal and what's not based on past observations. The filter helps attenuate large spikes when enough variability is present, but for very stable metrics or when encountering the first spike, it will distort the band. If the spike keeps re-occurring, then the expanded band is accurate, but if it is a one-off, the distortion is undesirable. A decay function would help bring the band down to acceptable values when the spike does not happen again. Nevertheless, it should not go back to 0, as the increased variability caused by the spike should have an effect for the 24h period. At the end of the day, we want to detect when a metric changes too fast, unless the pattern has been recently seen within a 24h period. Note that if we remove the filter and moving avg altogether and just do a stddev over 26h, there will be some natural attenuation happening (you can test this with the demo dashboard; you can also use it to add your approaches and compare them side by side!). The catch is that sustained spikes (rather than short ones) cause much higher distortion of the bands. I hope this helps, I don't have a lot of time to do extensive testing right now, but pinging @xujiaxj, @jradhakrishnan @nandadev and @manoja1 in case they have some suggestions. Once again thank you for helping improve this framework, and feel free to open a PR adding your ideas to the demo rules and dashboard! |
Hi all and thanks for the framework!
As I'm experimenting with it on real data, I noticed the following behavior.
On a very stable APIs, a big spike (e.g. p95 increase of 50/100 times) immediately followed by another stable period, skews the upper band for a long time.
I think this is given by the 26h smoothing factor applied to the
stddev_1h
, together with the high filter applied onanomaly:stddev_1h:filtered
that's immediately filtering out the stddev after the spike.If my understanding is correct, this might result in the framework masking anomalies for up to 26h following a big spike - see attached screenshot
What could be a good solution to avoid such situations?
The only thing I could think of was to reduce the smoothing period of 26h, but that would still hide anomalies for that amount of time.
Thank you!
The text was updated successfully, but these errors were encountered: