discussion: backfill subqueries/CTEs "one by one" in a big complex query automatically #17404

lmatz · 2024-06-23T12:49:15Z

One observation is that users often write a large complex query that has many CTE/subqueries and then build an MV by this query.

This is arguably not the best way to organize queries, but it does appear often.

Maybe because it is just the way how users like to organize their queries,
or maybe because the queries are migrated from existing OLAP systems in tech stacks,
or perhaps users are given the impression that this is the most efficient way of building an MV (e.g. maybe in terms of latency, maybe in terms of not having extra storage cost by having intermediate MVs).

Anyway, the reasons can be various and generally unknown to us. And, without users approaching us, we may not have the chance to give better suggestions.

However, it is not an efficient way of building an MV.

The reason is intuitive: running a query with 10 CTE/subqueries is similar to running 10 different queries at the same time)

There can be serious competition for resources:

memory. We learned from many PoC cases recently that high remote IOPS induced by cache miss is one of the most important factors that lead to bad performance.
CPU. The block cache/meta cache will be very busy bringing in and evicting data. The data needs to be deserialized before they can be used by operators, which consumes more CPU.
Remote I/O. Due to (1), RW may hit object storage's rate limit more frequently.

We remark that the competition for resources is not just implying that each of the 10 queries gets 1/10 of total resources, so each of them is expected to deliver 1/10 of the performance, but the fact that extra overhead leads to even worse performance than 1/10.

One immediate thought is to decompose the big complex query automatically into groups of operators (limited to the dependency) and backfill each group one by one (Of course, if the resources are more than enough, we can parallelize the execution of multiple groups).

It can also be seen as inserting implicit MVs into this big complex query but these MVs are transparent to users.

Also, aggregations can produce many intermediate changes, it is simply better if we can build the state of subsequent operators and (implicit)MVs on top of the final results of backfilling instead of processing unnecessary intermediate changes.

BugenZhao · 2024-06-25T03:54:19Z

One immediate thought is to decompose the big complex query automatically into groups of operators (limited to the dependency) and backfill each group one by one

One point to note is probably that, by "backfilling" it means that the result sets of these groups have to be materialized, so extra overhead is also introduced.

Also link to #12771 cuz it sounds like an opposite to the idea proposed here, that is, encourage users to create multiple materialized views together based on change logs and avoid backfilling as much as possible.

lmatz · 2024-07-15T17:04:09Z

Anyway, let me make it into the documentation as a suggestion first.

In terms of the extra materialization, I remember there was an issue that discussed the idea of keeping the query but removing the results in the MV.

#12771 can also be reasonable, depending on the cluster utilization and workload.

github-actions · 2024-10-06T02:06:42Z

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

lmatz · 2024-10-14T13:10:06Z

With serverless backfilling enabled on RW Cloud, this issue becomes less significant.

lmatz · 2024-11-20T09:44:34Z

it may be related to #14559, just tag it here for visiblity.

lmatz added the needs-discussion label Jun 23, 2024

github-actions bot added this to the release-1.10 milestone Jun 23, 2024

lmatz removed this from the release-1.10 milestone Jul 10, 2024

github-actions bot added the no-issue-activity label Oct 6, 2024

lmatz closed this as not planned Won't fix, can't repro, duplicate, stale Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discussion: backfill subqueries/CTEs "one by one" in a big complex query automatically #17404

discussion: backfill subqueries/CTEs "one by one" in a big complex query automatically #17404

lmatz commented Jun 23, 2024

BugenZhao commented Jun 25, 2024

lmatz commented Jul 15, 2024

github-actions bot commented Oct 6, 2024

lmatz commented Oct 14, 2024

lmatz commented Nov 20, 2024

discussion: backfill subqueries/CTEs "one by one" in a big complex query automatically #17404

discussion: backfill subqueries/CTEs "one by one" in a big complex query automatically #17404

Comments

lmatz commented Jun 23, 2024

BugenZhao commented Jun 25, 2024

lmatz commented Jul 15, 2024

github-actions bot commented Oct 6, 2024

lmatz commented Oct 14, 2024

lmatz commented Nov 20, 2024