Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opensearch 2.16.0 breaks alerts #20119

Closed
fjl82 opened this issue Aug 8, 2024 · 15 comments
Closed

Opensearch 2.16.0 breaks alerts #20119

fjl82 opened this issue Aug 8, 2024 · 15 comments
Labels

Comments

@fjl82
Copy link

fjl82 commented Aug 8, 2024

Last night, Opensearch got upgraded from 2.15.0 to 2.16.0. Nothing else was changed. After this, alerts started coming in with a high message count but no messages listed in the email (normally max 3 are included). Using the search replay shows no messages either. It seems to apply to all configured alerts. Normal message searches also work fine.
Trying to downgrade Opensearch fails. On startup it stops with an error:
java.lang.IllegalStateException: cannot downgrade a node from version [2.16.0] to version [2.15.0]
If you need me to check anything, or need more info, let me know.

Expected Behavior

Alerts should behave as before on 2.15.0.

Current Behavior

Alerts keep triggering with an ever rising message count

Possible Solution

Support opensearch 2.16.0

Steps to Reproduce (for bugs)

  1. Upgrade an existing installation with Opensearch 2.15.0 to 2.16.0

Context

Alerts are currently unusable. This is the feature in Graylog we use most (alerting us to application issues).

Your Environment

  • Graylog Version: 6.0.5
  • Java Version: 17.0.12+7-1ubuntu2~22.04
  • OpenSearch Version: 2.16.0
  • MongoDB Version: 6.0.16
  • Operating System: Ubuntu 22.04
  • Browser version: n/a
@fjl82 fjl82 added the bug label Aug 8, 2024
@drewmiranda-gl
Copy link
Member

Greetings! Can you expand a bit on this?

alerts started coming in with a high message count but no messages listed in the email (normally max 3 are included)

Can you clarify what "high message count" means? Does this mean the search query for the event definition returned an usually high number of messages? OR is this specifically the event notification? Can you provide a screenshot to help clarify?

Using the search replay shows no messages either

Can you provide a screenshot? To clarify, this is when you click the replay search URL in the notificaiton or via the alerts screen in graylog?

Trying to downgrade Opensearch fails

This is expected unfortunately. Typically new OpenSearch versions introduce a new Lucene version. OpenSearch 2.16 updated lucene from 9.10 to 9.11.1. Unfortunately it is impossible to downgrade the lucene version of an OpenSearch cluster.

Can you share your server.log or an except or any applicable error messages?

Thanks!

@fjl82
Copy link
Author

fjl82 commented Aug 9, 2024

Hi!
With "high message count" i mean this:
image
The search query seemed to return nothing, but the count is high, even though it actually should be zero, and therefor triggers an alert (via email). The alert is configured to check every minute and alert every hour.
Search replay:
image
image
This is also reflected in the emails. Normally there are the first 3 messages included, but this is empty for these "false" emails.

The server.log does not seem to have much useful information regarding this, as far as I can tell. Just a lot of lines like:
2024-08-09T06:56:08.922+02:00 WARN [PivotAggregationSearch] Removing non-existing streams <[582b0b0561432f0403f4d8d0]> from event definition <5f4612866ece0c18ab09e6e4>/<...>
Not sure what they mean, but it sounds like some legacy events that no one cares about anymore and are probably broken due to invalid configuration. And this is only for a subset of the events, not all of them. The ones I care about do not have such log messages.

[edit:] Oh yeah, forgot to mention, the streams themselves still work fine, and viewing them insides graylog as well. If I open them from the Streams overview I can view messages like usual. There have been a few real events logged yesterday. This problem only applies to alerting.

[edit2:] I deleted the old unused alerts and streams and the server.log is now silent.

@drewmiranda-gl
Copy link
Member

the streams themselves still work fine, and viewing them insides graylog as well. If I open them from the Streams overview I can view messages like usual

Can you confirm the stream IDs match between the id in server.log and the id in the browser address bar for that stream?

I deleted the old unused alerts and streams and the server.log is now silent.

Good to hear. Are you confident OpenSearch 2.16.0 is working as expected and Graylog is working as expected?

@dhedberg
Copy link

dhedberg commented Aug 9, 2024

I think we're seeing the same issue after an upgrade. I tried a quick grep for "Removing non-existing" in server.log, but didn't see anything.

For example, one alert has a condition of "count > 600", and a replay of the search shows a count 14. But the alert is triggered due to the claimed count of "950013".

Some filter that's not set correctly anymore?

We seem to have upgraded both opensearch from 2.14.0 to 2.16.0 and graylog from 6.0.2 to 6.0.5 today.

@fjl82
Copy link
Author

fjl82 commented Aug 9, 2024

the streams themselves still work fine, and viewing them insides graylog as well. If I open them from the Streams overview I can view messages like usual

Can you confirm the stream IDs match between the id in server.log and the id in the browser address bar for that stream?

I deleted the old unused alerts and streams and the server.log is now silent.

Good to hear. Are you confident OpenSearch 2.16.0 is working as expected and Graylog is working as expected?

I think you misunderstood. The lines in the server.log were related to old streams I was no longer interested in. So I deleted those streams, and those specific lines in server.log have stopped. But those had no relation to the problem that I was seeing. I posted those lines because you asked for things in server.log, not because it had a connection to the issue. Those "removing non-existing" lines were caused by some alert was referring to a stream that was already deleted.

What @dhedberg is saying sounds the same as the problem we're having. What I didn't mention before is that I also had this problem on graylog 6.0.4. This version was running on the day I found the issue. I upgraded to 6.0.5 to see if it would resolve the problem, but it didn't.

@drewmiranda-gl
Copy link
Member

@fjl82 thank you for clarifying.

Do you feel comfortable sharing your event definition that is causing thing? If possible exporting it to a content pack? My goal is try to and understand how to reproduce the issue.

If i may: a summary of the issue is that your event criteria is not behaving as expected, the resulting query returns much more data than you expect (you expect 0). The replay search (or running the search query directly) show different results than the event.

Is this correct?

@dhedberg
Copy link

Without having made any effort to understand the code and queries involved I took a quick look at the opensearch issue tracker.

Might opensearch-project/OpenSearch#15169 be related? Just based on the fact that it apparently broke in 2.16.0 and involves a query being ignored.

@kingzacko1
Copy link
Contributor

@dhedberg potentially, yes. We were looking into that very issue to see if it could be the root cause here, but we were unable to recreate the issue in our test environments running OpenSeach 2.16.0. We were hoping to get more information about any event definitions that were causing the problem so we can reliably reproduce the issue and figure out if it is on our end or due to that (or another) OpenSearch issue.

@fjl82
Copy link
Author

fjl82 commented Aug 12, 2024

all our events are causing this. I don't know anything about content packs (never used or created them), but our event definition looks like this:
image
It is using a stream with very simple rules:
image

I've read that Opensearch issue report and it does sound like it can cause this problem. So it might not be a Graylog issue after all, it's likely that there's nothing to fix on Graylog side. Maybe it's just a simple Opensearch bug that will be fixed soon in a minor update. I'd just wish they would make it easy to rollback to a previous version, the current version lock is very problematic.

@dennisoelkers
Copy link
Member

dennisoelkers commented Aug 12, 2024

I can confirm that the OS queries we are generating in alerting do not return proper results against 2.16.0. All filters are ignored due to the usage of the date_range aggregation. This aggregation is not used in search, i.e. when the generated event is replayed, which explains why results are as expected then.

@bjoerntw
Copy link

I can confirm that after upgrading graylog-server from 6.0.4 to 6.0.5 and OpenSearch from 2.15.0 to 2.16.0 (OS = AlmaLinux 9.4) the Alerts are not working anymore, the count function is matching with numbers which aren´t showed when replaying the search.
Example:
Event definition:
0

Event match:
1

Replaying search:
2

@bernd
Copy link
Member

bernd commented Aug 15, 2024

There seems to be a workaround: opensearch-project/OpenSearch#15169 (comment)

@fjl82
Copy link
Author

fjl82 commented Aug 15, 2024

Thanks @bernd, I applied this setting and alerts seem to work ok again now.

@bjoerntw
Copy link

I can confirm that the workaround mentioned by @bernd is working for me, thanks!

@sethgraylog
Copy link
Collaborator

We have published an advisory regarding this issue that includes the work-around here https://graylog.org/post/alert-notice-opensearch-v2-16/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants