Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Search thread pool exceeded by security analytics detectors #629

Open
tallyoh opened this issue Oct 3, 2023 · 4 comments
Open

[BUG] Search thread pool exceeded by security analytics detectors #629

tallyoh opened this issue Oct 3, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@tallyoh
Copy link

tallyoh commented Oct 3, 2023

We are seeing a large number of rejected search threads after enabling a detector for windows logs with many sigma rules enabled. After enabling the detector, we see the search thread queue bump up to 1,000 every couple seconds. Over the last hour we see several million rejected searches (see attchment).

We noticed this while in the Discover page and querying our data. The searches started to fail frequently with

rejected_execution_exception
rejected execution of org.opensearch.common.util.concurrent.TimedRunnable@14a1906d on QueueResizingOpenSearchThreadPoolExecutor[name = os-01-data-01/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 13.4ms, adjustment amount = 50, org.opensearch.common.util.concurrent.QueueResizingOpenSearchThreadPoolExecutor@791a2b55[Running, pool size = 49, active threads = 49, queued tasks = 1000, completed tasks = 13607928]]
Error: Too Many Requests
    at fetch_Fetch.fetchResponse (http://192.168.60.30:5601/6614/bundles/core/core.entry.js:15:177222)
    at async interceptResponse (http://192.168.60.30:5601/6614/bundles/core/core.entry.js:15:172640)
    at async http://192.168.60.30:5601/6614/bundles/core/core.entry.js:15:175120

Steps to reproduce the behavior:

  1. Add a detector for windows logs - maping fields, etc.
  2. Start the detector and then run the following comand to see the thread queue
    GET _cat/thread_pool?v&s=rejected:desc
  3. Watch for the rejected count to start incrementing
  4. Watch for search error in the discover screen (you may have to try several several in order to get a search error)
  5. If you disable the detector - then the issue seems to go away. Thread queue is more normal, and searches in discover dont seem to fail.

What is the expected behavior?
The analytics detectors should be able to moderate their activity against the cluster so as to not adversely affect other search activity, as well as allow the detection searches to complete without failing (I think that's what the "rejected" is indicating. Not sure if they are retrying or not)

What is your host/environment?

  • OS: 7 node cluster on oracle linux 8.8 with ssd local storage
  • OpenSearch 2.10 - fresh install
  • Plugins - default installed/enabled

image

Do you have any additional context?
Just before submitting this report - I noticed that two of our data nodes seemed to stop responding. They both heap dumped with a 30gb heap and 96 gb total ram. There is only one windows log index with about 50gb of data. After restarting the nodes. The detector had timed out. So we stopped it and restarted it. Then the search thread problem started up again. I suspect the analytics detector might be causing our nodes to crash as well. The data ingest and searching has been solid for a week without any issue. These symptoms seem to have started today with the detector enabled in security analytics.

@tallyoh tallyoh added bug Something isn't working untriaged labels Oct 3, 2023
@dcowan-e-courier
Copy link

I had the same issue when enabling windows event log detectors on AWS OpenSearch

@eirsep
Copy link
Member

eirsep commented Nov 1, 2023

@sbcd90 what do you think about creating multiple doc level monitors in one detector?

Reason: we can limit the number of rules in one doc level monitor as that would take care of not causing search rejections and search queueing. Since we use alerting workflow(composite monitor) which will execute monitors in sequence we can ensure that all the rules in one detector are not being executed at the same time in the percolate search.

the number of rules per doc monitor needs to be decided from a benchmarking

@sbcd90
Copy link
Collaborator

sbcd90 commented Nov 1, 2023

hi @eirsep , that makes sense. but do we need to execute monitors in sequence? once we create multiple doc level monitors they can run in parallel asynchronously & generate findings.

@eirsep
Copy link
Member

eirsep commented Nov 1, 2023

hi @eirsep , that makes sense. but do we need to execute monitors in sequence? once we create multiple doc level monitors they can run in parallel asynchronously & generate findings.

  • We currently have chained findings feature for bucket level monitors which require few monitors to execute before another monitor but we can certainly enhance them.
  • but executing in parallel will defeat the purpose of splitting rules into different monitors as the concurrent searches being done will not reduce and all 1000 rules/queries are being queried at the same time.
  • other possible idea would be to improve percolate queries' control over how the underlying search requests are being executed but i dont know how feasible it is to do that. you can advise better @sbcd90

riysaxen-amzn pushed a commit to riysaxen-amzn/security-analytics that referenced this issue Feb 20, 2024
…sions. (opensearch-project#629)

* Updated the cypress workflow to support overriding the dependency versions.

Signed-off-by: AWSHurneyt <[email protected]>

* Overrode test workflow dependency versions to v2.8.

Signed-off-by: AWSHurneyt <[email protected]>

* Fixed override workflow.

Signed-off-by: AWSHurneyt <[email protected]>

* Adjusted workflow job order.

Signed-off-by: AWSHurneyt <[email protected]>

* Made gradle setup conditional.

Signed-off-by: AWSHurneyt <[email protected]>

* Made gradle setup unconditional again. Refactored version override process.

Signed-off-by: AWSHurneyt <[email protected]>

* Adjusted the override reference.

Signed-off-by: AWSHurneyt <[email protected]>

* Fixed typo in override job.

Signed-off-by: AWSHurneyt <[email protected]>

* Fixed override job.

Signed-off-by: AWSHurneyt <[email protected]>

* Fixed override job.

Signed-off-by: AWSHurneyt <[email protected]>

---------

Signed-off-by: AWSHurneyt <[email protected]>
riysaxen-amzn pushed a commit to riysaxen-amzn/security-analytics that referenced this issue Mar 25, 2024
Signed-off-by: opensearch-ci-bot <[email protected]>

Signed-off-by: opensearch-ci-bot <[email protected]>
Co-authored-by: opensearch-ci-bot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

5 participants