Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discussion: backfill subqueries/CTEs "one by one" in a big complex query automatically #17404

Closed
lmatz opened this issue Jun 23, 2024 · 5 comments

Comments

@lmatz
Copy link
Contributor

lmatz commented Jun 23, 2024

One observation is that users often write a large complex query that has many CTE/subqueries and then build an MV by this query.

This is arguably not the best way to organize queries, but it does appear often.

Maybe because it is just the way how users like to organize their queries,
or maybe because the queries are migrated from existing OLAP systems in tech stacks,
or perhaps users are given the impression that this is the most efficient way of building an MV (e.g. maybe in terms of latency, maybe in terms of not having extra storage cost by having intermediate MVs).

Anyway, the reasons can be various and generally unknown to us. And, without users approaching us, we may not have the chance to give better suggestions.

However, it is not an efficient way of building an MV.

The reason is intuitive: running a query with 10 CTE/subqueries is similar to running 10 different queries at the same time)

There can be serious competition for resources:

  1. memory. We learned from many PoC cases recently that high remote IOPS induced by cache miss is one of the most important factors that lead to bad performance.
  2. CPU. The block cache/meta cache will be very busy bringing in and evicting data. The data needs to be deserialized before they can be used by operators, which consumes more CPU.
  3. Remote I/O. Due to (1), RW may hit object storage's rate limit more frequently.

We remark that the competition for resources is not just implying that each of the 10 queries gets 1/10 of total resources, so each of them is expected to deliver 1/10 of the performance, but the fact that extra overhead leads to even worse performance than 1/10.

One immediate thought is to decompose the big complex query automatically into groups of operators (limited to the dependency) and backfill each group one by one (Of course, if the resources are more than enough, we can parallelize the execution of multiple groups).

It can also be seen as inserting implicit MVs into this big complex query but these MVs are transparent to users.

Also, aggregations can produce many intermediate changes, it is simply better if we can build the state of subsequent operators and (implicit)MVs on top of the final results of backfilling instead of processing unnecessary intermediate changes.

@github-actions github-actions bot added this to the release-1.10 milestone Jun 23, 2024
@BugenZhao
Copy link
Member

One immediate thought is to decompose the big complex query automatically into groups of operators (limited to the dependency) and backfill each group one by one

One point to note is probably that, by "backfilling" it means that the result sets of these groups have to be materialized, so extra overhead is also introduced.

Also link to #12771 cuz it sounds like an opposite to the idea proposed here, that is, encourage users to create multiple materialized views together based on change logs and avoid backfilling as much as possible.

@lmatz lmatz removed this from the release-1.10 milestone Jul 10, 2024
@lmatz
Copy link
Contributor Author

lmatz commented Jul 15, 2024

Anyway, let me make it into the documentation as a suggestion first.

In terms of the extra materialization, I remember there was an issue that discussed the idea of keeping the query but removing the results in the MV.

#12771 can also be reasonable, depending on the cluster utilization and workload.

Copy link
Contributor

github-actions bot commented Oct 6, 2024

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

@lmatz
Copy link
Contributor Author

lmatz commented Oct 14, 2024

With serverless backfilling enabled on RW Cloud, this issue becomes less significant.

@lmatz lmatz closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024
@lmatz
Copy link
Contributor Author

lmatz commented Nov 20, 2024

it may be related to #14559, just tag it here for visiblity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants