-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancement: the regex consumes an excessive amount of memory. #16664
Comments
Each executor within the same operator will create its own expressions, which are usually identical but consume multiple resources. Moreover, if the same expression node is present in different operators or streaming jobs, it can also be reused. We may consider perform some caching on different granularities:
|
Pls note that in this cluster I've raised backtrack_limit from 1_000_000 to 1_000_000_000 to work around Error::BacktrackLimitExceeded. So there's indeed very large regex state which requires sizable runtime memory. Because the memory pool never contracts in size, it may explain the large memory footprint. update: |
After limiting the regex's input to 100_000 and reverting backtrack_limit to default 1_000_000, regex still take 50GB memory: |
+1 for this conclusion. The comments here sound helpful. FYI. https://github.com/rust-lang/regex/blob/8856fe36ac7dc37989e6ffb26b5fc57189bae626/regex-automata/src/meta/regex.rs#L92 It seems the quickest solution to follow this usage, that is, sharing the same Regex object across multiple threads/tasks
But since we are using fancy-regex, I'm not sure whether it's doable |
So it appears that it's the dynamic part for using them consumes too much memory, instead of the static compiled part.
Do you mean that we intentionally let all regexs (with the same pattern) reuse the dynamic part to reduce pool size, at the cost of potential synchronization? Sounds reasonable as long as we find it does not affect performance too much. |
For this cluster, we may need all regexs of all patterns reuse the dynamic part |
Hmmm, in that way there will be much more lock contention. I tend to try "reusing all regexs (with the same pattern)" first. As mentioned before, we are using |
BTW, if there's backtrack in the regex, |
The pool is sharded to alleviate contention; however, as a side effect, it leads to an increase in memory usage since it is accessed by thread_id % MAX_POOL_STACKS. The memory usage can be amplified at most MAX_POOL_STACKS=8 times. Specific to this cluster, given that concurrent access to a regex's pool won't occur (one parallelism for each regex in each compute node), setting the shard num MAX_POOL_STACKS to 1 should be adequate. I will give it a try later when the env is available. |
Describe the bug
The cluster has 1000+ MVs, where each MV contains a regex. The regex lib holds 70+GB memory consistently.
manual.heap.collapsed.zip
Error message/log
No response
To Reproduce
No response
Expected behavior
No response
How did you deploy RisingWave?
No response
The version of RisingWave
v1.8
backtrack_limit increased from 1_000_000 to 1_000_000_000
Additional context
No response
The text was updated successfully, but these errors were encountered: