-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Search thread pool exceeded by security analytics detectors #629
Comments
I had the same issue when enabling windows event log detectors on AWS OpenSearch |
@sbcd90 what do you think about creating multiple doc level monitors in one detector? Reason: we can limit the number of rules in one doc level monitor as that would take care of not causing search rejections and search queueing. Since we use alerting workflow(composite monitor) which will execute monitors in sequence we can ensure that all the rules in one detector are not being executed at the same time in the percolate search. the number of rules per doc monitor needs to be decided from a benchmarking |
hi @eirsep , that makes sense. but do we need to |
|
…sions. (opensearch-project#629) * Updated the cypress workflow to support overriding the dependency versions. Signed-off-by: AWSHurneyt <[email protected]> * Overrode test workflow dependency versions to v2.8. Signed-off-by: AWSHurneyt <[email protected]> * Fixed override workflow. Signed-off-by: AWSHurneyt <[email protected]> * Adjusted workflow job order. Signed-off-by: AWSHurneyt <[email protected]> * Made gradle setup conditional. Signed-off-by: AWSHurneyt <[email protected]> * Made gradle setup unconditional again. Refactored version override process. Signed-off-by: AWSHurneyt <[email protected]> * Adjusted the override reference. Signed-off-by: AWSHurneyt <[email protected]> * Fixed typo in override job. Signed-off-by: AWSHurneyt <[email protected]> * Fixed override job. Signed-off-by: AWSHurneyt <[email protected]> * Fixed override job. Signed-off-by: AWSHurneyt <[email protected]> --------- Signed-off-by: AWSHurneyt <[email protected]>
Signed-off-by: opensearch-ci-bot <[email protected]> Signed-off-by: opensearch-ci-bot <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]>
We are seeing a large number of rejected search threads after enabling a detector for windows logs with many sigma rules enabled. After enabling the detector, we see the search thread queue bump up to 1,000 every couple seconds. Over the last hour we see several million rejected searches (see attchment).
We noticed this while in the Discover page and querying our data. The searches started to fail frequently with
Steps to reproduce the behavior:
GET _cat/thread_pool?v&s=rejected:desc
What is the expected behavior?
The analytics detectors should be able to moderate their activity against the cluster so as to not adversely affect other search activity, as well as allow the detection searches to complete without failing (I think that's what the "rejected" is indicating. Not sure if they are retrying or not)
What is your host/environment?
Do you have any additional context?
Just before submitting this report - I noticed that two of our data nodes seemed to stop responding. They both heap dumped with a 30gb heap and 96 gb total ram. There is only one windows log index with about 50gb of data. After restarting the nodes. The detector had timed out. So we stopped it and restarted it. Then the search thread problem started up again. I suspect the analytics detector might be causing our nodes to crash as well. The data ingest and searching has been solid for a week without any issue. These symptoms seem to have started today with the detector enabled in security analytics.
The text was updated successfully, but these errors were encountered: