Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sink decoupling during backfill for CREATE SINK INTO TABLE #19285

Open
hzxa21 opened this issue Nov 7, 2024 · 6 comments
Open

Support sink decoupling during backfill for CREATE SINK INTO TABLE #19285

hzxa21 opened this issue Nov 7, 2024 · 6 comments
Assignees
Milestone

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Nov 7, 2024

Is your feature request related to a problem? Please describe.

Backfilling can backpressure upstream, causing the existing streaming jobs to be slower or even stuck. There are three cases where backfilling can happen:

  1. CREATE MV
  2. CREATE SINK with connector
  3. CREATE SINK INTO TABLE

The current way to mitigate backfilling effect on upstream

  • SET BACKFILL_RATE_LIMIT to xxx. Supported for 1, 2, 3.
  • SET sink_decouple to true (default on). Supported for 2.
  • SET streaming_use_snapshot_backfill to true (default off, experimental now). Supported for 1.

The only effective way for 3 is use rate limit, which requires manual operation and understanding on the workload before determining a good value. Therefore, I think we should also support sink decoupling for sink into table as well. This is also a perquisite of doing severless backfill for sink into table.

Describe the solution you'd like

There are two ways to implement sink decoupling for sink into table:

  1. Use kv log store for SINK INTO TABLE, similar to what we did for sink with connector.
  2. Record L0 changelog and support snapshot backfilling for SINK INTO TABLE, similar to what we did for MV.

Describe alternatives you've considered

No response

Additional context

No response

@github-actions github-actions bot added this to the release-2.2 milestone Nov 7, 2024
@st1page
Copy link
Contributor

st1page commented Nov 7, 2024

The general idea LGTM. I think we need some more detailed design to ensure that the data in the log store can converge to 0

@kwannoel kwannoel self-assigned this Nov 20, 2024
@kwannoel
Copy link
Contributor

The work has overlap with unaligned join (log store executor can be used for both). Will write a design doc for this.

@kwannoel
Copy link
Contributor

kwannoel commented Nov 25, 2024

Actually why just during backfill? Shouldn't sink_decouple always let the downstream sink be decoupled from upstream?

@st1page
Copy link
Contributor

st1page commented Nov 25, 2024

Actually why just during backfill? Shouldn't sink_decouple always let the downstream sink be decoupled from upstream?

Outside of the backfilling period, the downstream MV will wait for the upstream barrier to align, and there is no way to make the downstream progress faster.

@kwannoel
Copy link
Contributor

Actually why just during backfill? Shouldn't sink_decouple always let the downstream sink be decoupled from upstream?

Outside of the backfilling period, the downstream MV will wait for the upstream barrier to align, and there is no way to make the downstream progress faster.

Why? If we use kv_log_store, it will just buffer the changes, and barrier can go pass once these changes have been written to the logstore.

@st1page
Copy link
Contributor

st1page commented Nov 27, 2024

Actually why just during backfill? Shouldn't sink_decouple always let the downstream sink be decoupled from upstream?

Outside of the backfilling period, the downstream MV will wait for the upstream barrier to align, and there is no way to make the downstream progress faster.

Why? If we use kv_log_store, it will just buffer the changes, and barrier can go pass once these changes have been written to the logstore.

  1. In the current design of other sink's logstore, barriers are also persisted in the logstore.
  2. If you are considering using this approach. If there is a crash in the upstream sink, the changes on the downstream table cannot be updated exactly once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants