Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discussion: allow stream query on creating mv? #12771

Open
st1page opened this issue Oct 11, 2023 · 14 comments
Open

discussion: allow stream query on creating mv? #12771

st1page opened this issue Oct 11, 2023 · 14 comments
Assignees
Milestone

Comments

@st1page
Copy link
Contributor

st1page commented Oct 11, 2023

offline discussed with @wyhyhyhyh
Currently, the creating MV is invisible and the user can not do batch or streaming query on it.

dev=> SET BACKGROUND_DDL=true;
SET_VARIABLE
dev=> CREATE TABLE t (v1 int);
CREATE_TABLE
dev=> INSERT INTO t select * from generate_series(1, 300000);
INSERT 0 300000
dev=> FLUSH;
FLUSH
dev=> CREATE MATERIALIZED VIEW m1 as SELECT * FROM t;
CREATE_MATERIALIZED_VIEW
dev=> CREATE MATERIALIZED VIEW m2 as SELECT * FROM m1;
ERROR:  QueryError: Catalog error: table or source not found: m1

Complex data processing pipelines usually are layered which include many materialized views depending on each other. While creating a stream query on the existing MV, RW will backfill all the historical data in the upstream MV and union it with the coming changes.
Considering there are mvA, a mvB on mvA, and a mvC on mvB(S->MvA->MvB->MvC). Under the current design, the user must create the materialized views one by one. He should create the MvA and wait for all historical data to be backfilled in MvB, and only after that, he can create the mvB. But if the pipeline can be constructed at a time, we can see the backfilling is unnecessary.
We can achieve that in two ways.

  1. allow users to create multiple materialized views using syntax like "transactional DDL"
  2. allow users to create stream query on the creating mv, so that the downstream mv will consume the historical data faster(because there are less data than the base table) and catch up with the upstream.
@github-actions github-actions bot added this to the release-1.4 milestone Oct 11, 2023
@fuyufjh

This comment was marked as outdated.

@st1page
Copy link
Contributor Author

st1page commented Oct 11, 2023

Also for user experience, I think we should allow stream query on creating mv, or he must wait for the mv backfill all the historical data.

@fuyufjh

This comment was marked as outdated.

@chenzl25
Copy link
Contributor

allow users to create stream query on the creating mv, so that the downstream mv will consume the historical data faster(because there are less data than the base table) and catch up with the upstream.

Personally I prefer the second one, because it is more practical and the concept of transactional ddl is too big? BTW, recoverable backfill is also needed in this case. We can return the ddl immediately to make creating mv visible to streaming query and batch query. But for batch query consistency, we need to block it until all its upstream backfilling mvs are finished.

@kwannoel
Copy link
Contributor

kwannoel commented May 12, 2024

allow users to create stream query on the creating mv, so that the downstream mv will consume the historical data faster(because there are less data than the base table) and catch up with the upstream.

Personally I prefer the second one, because it is more practical and the concept of transactional ddl is too big? BTW, recoverable backfill is also needed in this case. We can return the ddl immediately to make creating mv visible to streaming query and batch query. But for batch query consistency, we need to block it until all its upstream backfilling mvs are finished.

I think this sounds reasonable, with a session variable to configure it. For implementation, we just need to sync the catalog back to the frontend.
Note that we must differentiate Finished and Creating States of the MV, within the catalog, so we know whether to expose an MV to the batch side.

I'm already planning to work on this part (syncing catalog to fe), so we can unify Drop and Cancel, so the work should have some overlap.

@kwannoel kwannoel self-assigned this May 12, 2024
@xxchan
Copy link
Member

xxchan commented May 12, 2024

it looks wierd to me that a creating MV can't be SELECT-ed but can be referred to create new MV. From user's perspective, the MV is still unavailable.

BTW, if we choose this approach, maybe we can return a notice on the second CREATE MV (maybe also for SET BACKGROUND_DDL=true, maybe link to a doc page) to explain the behavior.

@kwannoel
Copy link
Contributor

Btw we also need a synchronising mechanism, like wait to indicate when the MV is finished with backfill. The current implementation of it polls meta. We can let it subscribe to the observer instead.

@fuyufjh
Copy link
Member

fuyufjh commented May 13, 2024

allow users to create stream query on the creating mv, so that the downstream mv will consume the historical data faster(because there are less data than the base table) and catch up with the upstream.

Personally I prefer the second one, because it is more practical and the concept of transactional ddl is too big? BTW, recoverable backfill is also needed in this case. We can return the ddl immediately to make creating mv visible to streaming query and batch query. But for batch query consistency, we need to block it until all its upstream backfilling mvs are finished.

+1 for this idea.

Think a step further: (assuming MV2 depends on MV1 and both are creating)

  1. If the MV2 completes backfilling first, it doesn't mean MV2 has been created but only means it catches the progress with MV1. Thus, it's still invisible.
  2. If recovery happens while MV1 is creating, both MV1 and MV2 fail to create and needs to be removed from metadata. More accurately, an MV must not be considered as completed before its dependencies completes - If MV1 completes and MV2 is on-going, it's okay to mark MV1 as succeeded.

Furthermore, MV2->MV1 is the simplest form. The actual implementation need to handle a DAG of creating MVs, which seems to impose lots of complexity in stream manager.

@kwannoel
Copy link
Contributor

We should only provide this feature on background ddl I suppose. Because in many cases, users are using DBT to handle creation of stream job DAG.

For a normal stream job, we only return a response once it is done. If using background ddl, we will immediately return a response on firing the command. Then DBT can immediately continue to create the next MVs.

@kwannoel
Copy link
Contributor

Furthermore, MV2->MV1 is the simplest form. The actual implementation need to handle a DAG of creating MVs, which seems to impose lots of complexity in stream manager.

For each MV we create, we now also need to watch its upstream MVs, and only mark its state as Finished, once the upstream MVs are also Finished.
This should not affect foreground jobs, since this feature should only be provided for background ddl.

In terms of cancelling / dropping the streaming DAG, once we unify cancel / drop, we can reuse the cascade logic of drop to handle that.

@st1page
Copy link
Contributor Author

st1page commented May 15, 2024

We should only provide this feature on background ddl I suppose. Because in many cases, users are using DBT to handle creation of stream job DAG.

For a normal stream job, we only return a response once it is done. If using background ddl, we will immediately return a response on firing the command. Then DBT can immediately continue to create the next MVs.

IIUC, DBT driver has not do any special for it, so it does not use background ddl. Could and should it using background DDL by default? cc @chenzl25

@chenzl25
Copy link
Contributor

I think we should not enable background DDL by default for DBT. DBT has different models, e.g. materialized_view and table. Imaging models have the following dependency mv1 -> table2 -> mv3. If the backfilling is blocking, then table2 would contain all data from mv1 which I think is the most straightforward. If we use background DDL, the mv1 will be created immediately, but table2 could just read a small portion of the data of the backfilling mv1. Finally, mv3 will consume a small portion of data as well.

Copy link
Contributor

github-actions bot commented Aug 1, 2024

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

@kwannoel
Copy link
Contributor

kwannoel commented Oct 8, 2024

Wait for snapshot backfill implemented, then we discuss the feasibility. There could be a lot of complexity in the stream manager to maintain stream job status.

See:

Furthermore, MV2->MV1 is the simplest form. The actual implementation need to handle a DAG of creating MVs, which seems to impose lots of complexity in stream manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants