Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(backfill): Persist backfill operator state #9752

Merged
merged 108 commits into from
May 24, 2023
Merged

Conversation

kwannoel
Copy link
Contributor

@kwannoel kwannoel commented May 11, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Many thanks to @chenzl25 who has provided lots of advice, discussion and review around it.


Changes

Backfill state is persisted. There are still further changes before #8051 can be fully supported, specifically allowing meta to persist partially complete state, implementing the logic to recover from partially complete state, and defining valid partially complete states.

This also means pretty much all queries become stateful. This is due to StreamTableScan becoming stateful (if backfill is used).

Implementation notes

Persist backfill operator state. This is encoded in the following form:

| vnode | pos (upstream_pk) | backfill_finished |

Progress is partitioned by vnode and updated for all pos.
If backfill is finished, backfill_finished column should be true.

This PR involves changes to:

  • Meta service, so that it knows to drop and update internal tables for the indexes.
  • Frontend plan for StreamTableScan, to create a table catalog for it.
  • The backfill executor itself which contains persistance logic.

Closes #6275

Checklist:

  • Persist backfill end state.
  • Fix errors from ./risedev test.
  • Fix e2e test.
  • Ensure state table is partitioned by vnode, if scan is UpstreamHashShard.
  • Refine PR.
  • Remove dist_in_pk_indices, since we will write to all vnodes.
  • Check compute performance change. Benchmarking in progress.
  • Mention that progress update only happens on second barrier.

Checklist For Contributors

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • I have demonstrated that backward compatibility is not broken by breaking changes and created issues to track deprecated features to be removed in the future. (Please refer to the issue)
  • All checks passed in ./risedev check (or alias, ./risedev c)

Checklist For Reviewers

  • I have requested macro/micro-benchmarks as this PR can affect performance substantially, and the results are shown.

Documentation

  • My PR DOES NOT contain user-facing changes.
Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

  • Installation and deployment
  • Connector (sources & sinks)
  • SQL commands, functions, and operators
  • RisingWave cluster configuration changes
  • Other (please specify in the release note below)

Release note

@kwannoel kwannoel requested a review from chenzl25 May 11, 2023 08:19
@chenzl25 chenzl25 requested a review from BugenZhao May 11, 2023 09:17
@kwannoel kwannoel marked this pull request as ready for review May 12, 2023 18:10
@kwannoel

This comment was marked as outdated.

@kwannoel

This comment was marked as resolved.

@kwannoel kwannoel marked this pull request as draft May 12, 2023 19:02
@kwannoel kwannoel changed the title feat(backfill): Persist backfill operator state feat(backfill): Persist backfill operator state [IGNORE ME] May 13, 2023
@kwannoel kwannoel marked this pull request as ready for review May 13, 2023 04:49
@kwannoel kwannoel marked this pull request as draft May 13, 2023 04:58
@kwannoel kwannoel marked this pull request as ready for review May 13, 2023 05:12
@kwannoel kwannoel marked this pull request as draft May 13, 2023 05:17
@kwannoel kwannoel force-pushed the kwannoel/persist-backfill branch from 0fe7198 to 521d683 Compare May 15, 2023 15:17
@kwannoel kwannoel marked this pull request as ready for review May 15, 2023 15:18
@kwannoel kwannoel force-pushed the kwannoel/persist-backfill branch from 521d683 to 6302d34 Compare May 16, 2023 05:37
@codecov
Copy link

codecov bot commented May 17, 2023

Codecov Report

Merging #9752 (2cbd14a) into main (126782a) will decrease coverage by 0.07%.
The diff coverage is 20.91%.

@@            Coverage Diff             @@
##             main    #9752      +/-   ##
==========================================
- Coverage   71.17%   71.11%   -0.07%     
==========================================
  Files        1250     1249       -1     
  Lines      209551   209727     +176     
==========================================
- Hits       149155   149145      -10     
- Misses      60396    60582     +186     
Flag Coverage Δ
rust 71.11% <20.91%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/common/src/hash/consistent_hash/bitmap.rs 70.00% <0.00%> (-30.00%) ⬇️
src/common/src/hash/consistent_hash/vnode.rs 93.85% <ø> (ø)
src/common/src/util/stream_graph_visitor.rs 53.89% <0.00%> (-0.71%) ⬇️
...rc/frontend/src/optimizer/property/distribution.rs 73.11% <ø> (ø)
src/meta/src/rpc/ddl_controller.rs 0.00% <0.00%> (ø)
src/stream/src/executor/backfill.rs 0.00% <0.00%> (ø)
src/stream/src/from_proto/chain.rs 0.00% <0.00%> (ø)
src/meta/src/manager/catalog/mod.rs 24.74% <12.19%> (-0.43%) ⬇️
src/frontend/src/optimizer/plan_node/mod.rs 93.18% <100.00%> (ø)
...ntend/src/optimizer/plan_node/stream_table_scan.rs 98.61% <100.00%> (+1.95%) ⬆️
... and 2 more

... and 13 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@chenzl25 chenzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we think there are no internal states for creating indexes previously, but after this PR, every mv on mv will have states. We need to modify the meta to persist those internal states for creating indexes, otherwise, we can't see the backfill internal states for indexes.

src/stream/src/executor/backfill.rs Outdated Show resolved Hide resolved
src/stream/src/executor/backfill.rs Outdated Show resolved Hide resolved
src/stream/src/executor/backfill.rs Outdated Show resolved Hide resolved
@kwannoel kwannoel force-pushed the kwannoel/persist-backfill branch 2 times, most recently from 424daa3 to c33d8d5 Compare May 19, 2023 11:03
@kwannoel kwannoel force-pushed the kwannoel/persist-backfill branch from f1d2b5a to 76b5206 Compare May 19, 2023 14:06
@kwannoel kwannoel force-pushed the kwannoel/persist-backfill branch from ecb6ba3 to 2bd05bb Compare May 19, 2023 18:10
@kwannoel

This comment was marked as outdated.

@kwannoel kwannoel force-pushed the kwannoel/persist-backfill branch 2 times, most recently from 451cbe4 to 330d5af Compare May 19, 2023 19:09
Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we decided to store the state per-vnode, are we going to support scaling a materialized view that is being created? If so, I guess there's some further stuff to resolve:

  • We should revisit the logic of the scaling on the meta service.
    state @ table_fragments::State::Initial
    | state @ table_fragments::State::Creating => {
    bail!(
    "the materialized view of fragment {fragment_id} is in state {}",
    state.as_str_name()
    )
    }
  • We need to refine the progress report to vnode granularity, instead of actor in the current design.

src/stream/src/executor/backfill.rs Outdated Show resolved Hide resolved
@BugenZhao
Copy link
Member

BTW, do we have some tests for checking whether the recovery works? IMO, we should check the fail-over occurred

  • during backfill
  • after the whole progress finishes
  • after some actors finish their progress while others not

A convenient way for testing this could be writing some unit tests, where we can drop the executor instance and rebuild it with the same state store backend to simulate the fail-over progress. For example:

// push the 3nd barrier
tx.push_barrier(3, false);
executor.next().await.unwrap().unwrap();
// Drop executor
drop(executor);
// Build new executor
let (executor, mut tx) = create_watermark_filter_executor(mem_state.clone()).await;
let mut executor = executor.execute();
// push the 1st barrier after failover
tx.push_barrier(4, false);
executor.next().await.unwrap().unwrap();
// Init watermark after failover
let watermark = executor.next().await.unwrap().unwrap();
assert_eq!(
watermark.into_watermark().unwrap(),
watermark!(ScalarImpl::Timestamp(
Date::from_ymd_uncheck(2022, 11, 9).and_hms_uncheck(0, 0, 0)
))
);

@kwannoel
Copy link
Contributor Author

kwannoel commented May 23, 2023

BTW, do we have some tests for checking whether the recovery works? IMO, we should check the fail-over occurred

Recovery is not fully supported in this PR. Agreed on adding the tests in future PRs, when we support recovery for partial backfill state.

This PR just adds state persistence, there still needs to be changes to meta to scale created mview.
Backfill executor itself also needs to decide what to do in the future with a mix of finished and unfinished progress.

For now I think adding some simple recovery tests to check internal state pre and post recovery should suffice + materialized view contents. What do you think?

@BugenZhao
Copy link
Member

For now I think adding some simple recovery tests to check internal state pre and post recovery should suffice + materialized view contents. What do you think?

Recovery is not fully supported in this PR.

Then I think it's okay to leave it to next PRs. 🫡

src/stream/src/executor/backfill.rs Outdated Show resolved Hide resolved
src/meta/src/manager/catalog/mod.rs Outdated Show resolved Hide resolved
@kwannoel kwannoel requested review from BugenZhao, hzxa21 and chenzl25 May 23, 2023 14:55
@kwannoel

This comment was marked as resolved.

Copy link
Contributor

@chenzl25 chenzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

src/meta/src/manager/catalog/mod.rs Outdated Show resolved Hide resolved
src/meta/src/manager/catalog/mod.rs Outdated Show resolved Hide resolved
@kwannoel kwannoel added this pull request to the merge queue May 24, 2023
Merged via the queue into main with commit 5e80fd7 May 24, 2023
@kwannoel kwannoel deleted the kwannoel/persist-backfill branch May 24, 2023 05:58
kwannoel added a commit that referenced this pull request May 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support backfill operator for mv on mv
4 participants