Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use blocking (foreground) DDL for source and sinks #15587

Closed
fuyufjh opened this issue Mar 11, 2024 · 16 comments
Closed

Use blocking (foreground) DDL for source and sinks #15587

fuyufjh opened this issue Mar 11, 2024 · 16 comments
Assignees
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Mar 11, 2024

Background

We know that create mv on table/mv must wait until the backfilling completed by default, otherwise, the newly created MV cannot be consistent with its upstream, which breaks RisingWave's consistency guarantee as well as PG compatibility.

However, regarding of sink and sources, we have to define the behavior. The acceptable options include:

  1. create mv on table/mv blocks until backfilling completed. create mv on source and create sink returns immediately.
  2. create mv on table/mv/source blocks until backfilling completed. create sink returns immediately.
  3. create mv on table/mv/source and create sink all block until backfilling completed. 🥇

According to a poll in Slack channel, option 3 wins.

Another advantage of blocking DDL is that the errors can be explicitly returned to users. On the contrary, a failure during background DDL happens after the object becomes available to users, thus, it can only be printed to logs, which is much more difficult for users.

Initial motivation: https://risingwave-labs.slack.com/archives/C06B3SEQQQ1/p1708499966310609?thread_ts=1708489235.143729&cid=C06B3SEQQQ1

Tasks

We have already report the errors during DDL to frontend sessions. #13108

The remaining task is to adjust the default behavior - Use blocking (foreground) DDL for source and sinks for both create mv on table/mv/source and create sink by default.

Note that this task only changes the default behavior. If users specify set background_ddl = true manually, we should follow it as usual.

@github-actions github-actions bot added this to the release-1.8 milestone Mar 11, 2024
@BugenZhao
Copy link
Member

Is #14172 a prerequisite for this task?

@lmatz
Copy link
Contributor

lmatz commented Mar 15, 2024

will background ddl support create index?

@kwannoel
Copy link
Contributor

will background ddl support create index?

Yes eventually.

@kwannoel
Copy link
Contributor

kwannoel commented Mar 21, 2024

I think #13035 is a prerequisite for this.

To me this is the main benefit:

Another advantage of blocking DDL is that the errors can be explicitly returned to users. On the contrary, a failure during background DDL happens after the object becomes available to users, thus, it can only be printed to logs, which is much more difficult for users.


Is #14172 a prerequisite for this task?

I suppose so. I think only sources with backfill will support it. For source created w/o backfill, we can just return immediately.
So for now it will just be supported for cdc backfill.

@xxchan
Copy link
Member

xxchan commented Mar 26, 2024

+1 that "errors can be explicitly returned to users" is the largest benefit.

I think #13035 is a prerequisite for this.

Similarly for sources, we also tolerate errors extensively. e.g.,

let Ok(msg) = msg else {
tokio::time::sleep(Duration::from_millis(1000)).await;
self.rebuild_stream_reader_from_error(&source_desc, &mut stream, msg.unwrap_err())
.await?;
continue;
};

Note that sometimes the errors are tolerated very deep. e.g., I previously tested that when Kafka broker is killed, rebuild_stream_reader_from_error will not even be called. I guess this is because rdkafka performs retrying internally. This will make it very hard to report errors in a timely manner.

For sinks I'm not sure, maybe it can raise some errors.

@BugenZhao
Copy link
Member

Similarly for sources, we also tolerate errors extensively.

This reminds me of the consensus we reached earlier: sources should try their best to retry connection failures (even indefinitely) instead of causing the entire streaming graph to enter a recovery loop, at least before risingwavelabs/rfcs#54 or risingwavelabs/rfcs#84 being implemented.

So does it mean that we will now introduce a different behavior to let the source error fail the job specifically when it's being created?

@xxchan
Copy link
Member

xxchan commented Mar 26, 2024

So does it mean that we will now introduce a different behavior to let the source error fail the job specifically when it's being created?

I think if the goal is better UX (know errors earlier, and no need to check logs), maybe yes. Alternatively, maybe we can do more extensive validation when creating source (maybe even attempt to consuming some events?? not always do-able though). But the challenge I mentioned above (error are hidden very deep) still applies.

Do you have any better ideas for this goal?

@fuyufjh
Copy link
Member Author

fuyufjh commented Apr 3, 2024

Ping. Shall we start this?

@kwannoel
Copy link
Contributor

kwannoel commented Apr 3, 2024

Ping. Shall we start this?

Okay I will take a look at it tomorrow.

@kwannoel
Copy link
Contributor

kwannoel commented Apr 3, 2024

Fyi, this task can lead to more complexity in the etcd backend, since we need to support background ddl now as well for source and sinks for users already used to / reliant on the current behaviour (immediate creation, no need wait for backfill).

I will start on it tomorrow, but unlikely to finish it that soon.

@kwannoel
Copy link
Contributor

kwannoel commented Apr 4, 2024

Working on blocking sink first + background_ddl on sink.

Source is not as urgent IMO.

@kwannoel
Copy link
Contributor

For kafka backfill, in 1.9 it won't be the default as mentioned by @xxchan offline.
So the create mv on source part will not be urgent to release.
Only when it is made default, we should have blocking kafka backfill.

Assigning this issue to @xxchan for the blocking create mv on source part.
I will work on the blocking sink part still.

@xiangjinwu
Copy link
Contributor

Another minor benefit for blocking create-table-with-connector is to cleanup sleep in e2e sqllogictest. Not big enough to affect the plan, but if the feature was finally made possible, we could leverage it.

@kwannoel
Copy link
Contributor

Remaining work is low priority.

@kwannoel kwannoel removed this from the release-1.9 milestone Apr 19, 2024
@kwannoel kwannoel removed their assignment Apr 19, 2024
@kwannoel kwannoel added this to the release-1.9 milestone Apr 19, 2024
Copy link
Contributor

github-actions bot commented Jul 3, 2024

This issue has been open for 60 days with no activity.
If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.
You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

@xxchan
Copy link
Member

xxchan commented Sep 11, 2024

remaining tasks tracked in #18338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants