[BUG] Search thread pool exceeded by security analytics detectors #629

tallyoh · 2023-10-03T18:43:56Z

We are seeing a large number of rejected search threads after enabling a detector for windows logs with many sigma rules enabled. After enabling the detector, we see the search thread queue bump up to 1,000 every couple seconds. Over the last hour we see several million rejected searches (see attchment).

We noticed this while in the Discover page and querying our data. The searches started to fail frequently with

rejected_execution_exception
rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@14a1906d on QueueResizingOpenSearchThreadPoolExecutor[name = os-01-data-01/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 13.4ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@791a2b55[Running, pool size = 49, active threads = 49, queued tasks = 1000, completed tasks = 13607928]]
Error: Too Many Requests
    at fetch_Fetch.fetchResponse (http://192.168.60.30:5601/6614/bundles/core/core.entry.js:15:177222)
    at async interceptResponse (http://192.168.60.30:5601/6614/bundles/core/core.entry.js:15:172640)
    at async http://192.168.60.30:5601/6614/bundles/core/core.entry.js:15:175120

Steps to reproduce the behavior:

Add a detector for windows logs - maping fields, etc.
Start the detector and then run the following comand to see the thread queue
GET _cat/thread_pool?v&s=rejected:desc
Watch for the rejected count to start incrementing
Watch for search error in the discover screen (you may have to try several several in order to get a search error)
If you disable the detector - then the issue seems to go away. Thread queue is more normal, and searches in discover dont seem to fail.

What is the expected behavior?
The analytics detectors should be able to moderate their activity against the cluster so as to not adversely affect other search activity, as well as allow the detection searches to complete without failing (I think that's what the "rejected" is indicating. Not sure if they are retrying or not)

What is your host/environment?

OS: 7 node cluster on oracle linux 8.8 with ssd local storage
OpenSearch 2.10 - fresh install
Plugins - default installed/enabled

Do you have any additional context?
Just before submitting this report - I noticed that two of our data nodes seemed to stop responding. They both heap dumped with a 30gb heap and 96 gb total ram. There is only one windows log index with about 50gb of data. After restarting the nodes. The detector had timed out. So we stopped it and restarted it. Then the search thread problem started up again. I suspect the analytics detector might be causing our nodes to crash as well. The data ingest and searching has been solid for a week without any issue. These symptoms seem to have started today with the detector enabled in security analytics.

The text was updated successfully, but these errors were encountered:

dcowan-e-courier · 2023-10-17T18:23:53Z

I had the same issue when enabling windows event log detectors on AWS OpenSearch

eirsep · 2023-11-01T15:49:42Z

@sbcd90 what do you think about creating multiple doc level monitors in one detector?

Reason: we can limit the number of rules in one doc level monitor as that would take care of not causing search rejections and search queueing. Since we use alerting workflow(composite monitor) which will execute monitors in sequence we can ensure that all the rules in one detector are not being executed at the same time in the percolate search.

the number of rules per doc monitor needs to be decided from a benchmarking

sbcd90 · 2023-11-01T18:21:20Z

hi @eirsep , that makes sense. but do we need to execute monitors in sequence? once we create multiple doc level monitors they can run in parallel asynchronously & generate findings.

eirsep · 2023-11-01T19:12:27Z

hi @eirsep , that makes sense. but do we need to execute monitors in sequence? once we create multiple doc level monitors they can run in parallel asynchronously & generate findings.

We currently have chained findings feature for bucket level monitors which require few monitors to execute before another monitor but we can certainly enhance them.
but executing in parallel will defeat the purpose of splitting rules into different monitors as the concurrent searches being done will not reduce and all 1000 rules/queries are being queried at the same time.
other possible idea would be to improve percolate queries' control over how the underlying search requests are being executed but i dont know how feasible it is to do that. you can advise better @sbcd90

…sions. (opensearch-project#629) * Updated the cypress workflow to support overriding the dependency versions. Signed-off-by: AWSHurneyt <[email protected]> * Overrode test workflow dependency versions to v2.8. Signed-off-by: AWSHurneyt <[email protected]> * Fixed override workflow. Signed-off-by: AWSHurneyt <[email protected]> * Adjusted workflow job order. Signed-off-by: AWSHurneyt <[email protected]> * Made gradle setup conditional. Signed-off-by: AWSHurneyt <[email protected]> * Made gradle setup unconditional again. Refactored version override process. Signed-off-by: AWSHurneyt <[email protected]> * Adjusted the override reference. Signed-off-by: AWSHurneyt <[email protected]> * Fixed typo in override job. Signed-off-by: AWSHurneyt <[email protected]> * Fixed override job. Signed-off-by: AWSHurneyt <[email protected]> * Fixed override job. Signed-off-by: AWSHurneyt <[email protected]> --------- Signed-off-by: AWSHurneyt <[email protected]>

Signed-off-by: opensearch-ci-bot <[email protected]> Signed-off-by: opensearch-ci-bot <[email protected]> Co-authored-by: opensearch-ci-bot <[email protected]>

tallyoh added bug Something isn't working untriaged labels Oct 3, 2023

riysaxen-amzn removed the untriaged label Mar 22, 2024

github-project-automation bot added this to Security Analytics Roadmap Aug 30, 2024

github-project-automation bot moved this to Bugs in Security Analytics Roadmap Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Search thread pool exceeded by security analytics detectors #629

[BUG] Search thread pool exceeded by security analytics detectors #629

tallyoh commented Oct 3, 2023

dcowan-e-courier commented Oct 17, 2023

eirsep commented Nov 1, 2023

sbcd90 commented Nov 1, 2023

eirsep commented Nov 1, 2023 •

edited

Loading

[BUG] Search thread pool exceeded by security analytics detectors #629

[BUG] Search thread pool exceeded by security analytics detectors #629

Comments

tallyoh commented Oct 3, 2023

dcowan-e-courier commented Oct 17, 2023

eirsep commented Nov 1, 2023

sbcd90 commented Nov 1, 2023

eirsep commented Nov 1, 2023 • edited Loading

eirsep commented Nov 1, 2023 •

edited

Loading