Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancement: the regex consumes an excessive amount of memory. #16664

Closed
zwang28 opened this issue May 9, 2024 · 9 comments
Closed

enhancement: the regex consumes an excessive amount of memory. #16664

zwang28 opened this issue May 9, 2024 · 9 comments
Labels
component/func-expr Support a SQL function or operator type/enhancement Improvements to existing implementation.

Comments

@zwang28
Copy link
Contributor

zwang28 commented May 9, 2024

Describe the bug

The cluster has 1000+ MVs, where each MV contains a regex. The regex lib holds 70+GB memory consistently.

manual.heap.collapsed.zip

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

v1.8

backtrack_limit increased from 1_000_000 to 1_000_000_000

Additional context

No response

@zwang28 zwang28 added the type/bug Something isn't working label May 9, 2024
@github-actions github-actions bot added this to the release-1.10 milestone May 9, 2024
@BugenZhao
Copy link
Member

Each executor within the same operator will create its own expressions, which are usually identical but consume multiple resources. Moreover, if the same expression node is present in different operators or streaming jobs, it can also be reused.

We may consider perform some caching on different granularities:

  • cache the results of build_expr based on their protobuf input
  • cache the "context"s that are believed to be memory-consuming based on its own input, like RegexpContext in this case

@BugenZhao BugenZhao added type/enhancement Improvements to existing implementation. component/func-expr Support a SQL function or operator and removed type/bug Something isn't working labels May 10, 2024
@zwang28 zwang28 changed the title bug: the regex consumes an excessive amount of memory. enhancement: the regex consumes an excessive amount of memory. May 10, 2024
@zwang28
Copy link
Contributor Author

zwang28 commented May 10, 2024

Pls note that in this cluster I've raised backtrack_limit from 1_000_000 to 1_000_000_000 to work around Error::BacktrackLimitExceeded. So there's indeed very large regex state which requires sizable runtime memory.

Because the memory pool never contracts in size, it may explain the large memory footprint.

update:
I've reverted the backtrack_limit to 1_000_000, but the memory usage remains high.

@zwang28
Copy link
Contributor Author

zwang28 commented Jul 1, 2024

After limiting the regex's input to 100_000 and reverting backtrack_limit to default 1_000_000, regex still take 50GB memory:

2024-07-01-09-00-17.manual.heap.collapsed.zip

@fuyufjh
Copy link
Member

fuyufjh commented Jul 1, 2024

Because the memory pool never contracts in size, it may explain the large memory footprint.

+1 for this conclusion. regex_automata::util::pool::inner::Pool<T,F>::get_slow takes ~47 GB memory in the memory flamegraph.

The comments here sound helpful. FYI. https://github.com/rust-lang/regex/blob/8856fe36ac7dc37989e6ffb26b5fc57189bae626/regex-automata/src/meta/regex.rs#L92

It seems the quickest solution to follow this usage, that is, sharing the same Regex object across multiple threads/tasks

/// * The same `Regex` is shared across multiple threads simultaneously,
/// usually via a [`util::lazy::Lazy`](crate::util::lazy::Lazy) or something
/// similar from the `once_cell` or `lazy_static` crates.

But since we are using fancy-regex, I'm not sure whether it's doable

@BugenZhao
Copy link
Member

Currently, a pool never contracts in size. Its size is proportional to the maximum number of simultaneous uses.

So it appears that it's the dynamic part for using them consumes too much memory, instead of the static compiled part.

It seems the quickest solution to follow this usage, that is, sharing the same Regex object across multiple threads/tasks

Do you mean that we intentionally let all regexs (with the same pattern) reuse the dynamic part to reduce pool size, at the cost of potential synchronization? Sounds reasonable as long as we find it does not affect performance too much.

@zwang28
Copy link
Contributor Author

zwang28 commented Jul 2, 2024

intentionally let all regexs (with the same pattern) reuse the dynamic part to reduce pool size

For this cluster, we may need all regexs of all patterns reuse the dynamic part
Because in this cluster a regex of the same pattern already occurs exactly once in each compute node, i.e. the parallelism of the regex is number of compute node.

@fuyufjh
Copy link
Member

fuyufjh commented Jul 2, 2024

intentionally let all regexs (with the same pattern) reuse the dynamic part to reduce pool size

For this cluster, we may need all regexs of all patterns reuse the dynamic part Because in this cluster a regex of the same pattern already occurs exactly once in each compute node, i.e. the parallelism of the regex is number of compute node.

Hmmm, in that way there will be much more lock contention. I tend to try "reusing all regexs (with the same pattern)" first. As mentioned before, we are using fancy-regex on top of regex_automata, and doing this requires refactoring the code to use regex_automata directly. Not sure how difficult it is.

@BugenZhao
Copy link
Member

I've reverted the backtrack_limit to 1_000_000, but the memory usage remains high.

BTW, if there's backtrack in the regex, fancy-regex won't dispatch the implementation to regex_automata but use its own implementation instead, which does not seem to involve a Cache. So I tend to believe it's different issue.

@zwang28
Copy link
Contributor Author

zwang28 commented Jul 2, 2024

The pool is sharded to alleviate contention; however, as a side effect, it leads to an increase in memory usage since it is accessed by thread_id % MAX_POOL_STACKS. The memory usage can be amplified at most MAX_POOL_STACKS=8 times.
See comments:
https://github.com/rust-lang/regex/blob/8856fe36ac7dc37989e6ffb26b5fc57189bae626/regex-automata/src/util/pool.rs#L287C1-L288C1

Specific to this cluster, given that concurrent access to a regex's pool won't occur (one parallelism for each regex in each compute node), setting the shard num MAX_POOL_STACKS to 1 should be adequate. I will give it a try later when the env is available.

@zwang28 zwang28 removed this from the release-2.0 milestone Aug 8, 2024
@zwang28 zwang28 closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/func-expr Support a SQL function or operator type/enhancement Improvements to existing implementation.
Projects
None yet
Development

No branches or pull requests

3 participants