fix(backfill): make tombstone iteration progress across all vnodes per epoch #17266

kwannoel · 2024-06-14T16:28:09Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Thanks @hzxa21 for sharing his diagnosis and ideas to solve it.

This is required because if we only yield the progress of a single vnode, the tombstone iteration of the other vnodes are dropped.
This is not an issue in no shuffle backfill, because the current pos is maintained on a parallelism level for all vnodes in it, rather than on vnode level like arrangement backfill.

We cannot simply yield all the scanned records downstream however, otherwise in each epoch, we now have the invariant of 256 records per epoch for backfill. If the processing latency of 256 records is high, barrier latency will be high.
So what we do is to add an yielded flag. In the case where there's no snapshot read in an epoch, we will set yielded flag to false, and continue to bump the current progress. Since we don't yield the records which were read from the snapshot, in the subsequent epoch, we will scan these same records, but use an inclusive bound with them, so they will be included in the newly scanned snapshot.

For normal snapshot read, yielded will always be true.

Want to add a separate test later.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added test labels as necessary. See details.
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

hzxa21 · 2024-06-14T17:30:24Z

src/frontend/src/optimizer/plan_node/stream_table_scan.rs

+        // `yielded` column
+        // Only present for arrangement backfill.
+        if is_arrangement_backfill {
+            catalog_builder.add_column(&Field::with_name(DataType::Boolean, "yielded"));


Do we need to ensure backward compatibility after adding this new column? In other words, is it possible that an arrangement backfill is triggered in old version and resume in new version?

True, I will handle it later.

Handled in 66f44cb.

We only need to handle the deserialization path. Because backfill jobs should fall into one of 3 categories:

Finished -> They won't write again to state table, no need to handle encoding path. Need to still read from state table, so must handle decoding path.

Not finished -> They will be cleaned after upgrade, no need to handle backwards compat at all.

Background ddl -> They won't be cleaned upgrade, and will panic. We can let users recreate these jobs. I think it's unlikely to encounter this since stream job creation should not take a long time. But if they do encounter, recreate stream job is an easy solution.

chenzl25 · 2024-06-15T03:29:49Z

risingwave/src/stream/src/executor/backfill/arrangement_backfill.rs

Lines 730 to 731 in d488f65

    
           // TODO(kwannoel): We can provide an option between snapshot read in parallel vs serial. 
        
           let vnode_row_iter = select_all(iterators);

Can we try to provide an option to allow arrangement backfill snapshot read to read in serial?

kwannoel · 2024-06-15T03:35:11Z

risingwave/src/stream/src/executor/backfill/arrangement_backfill.rs

Lines 730 to 731 in d488f65

// TODO(kwannoel): We can provide an option between snapshot read in parallel vs serial.

let vnode_row_iter = select_all(iterators);

Can we try to provide an option to allow arrangement backfill snapshot read to read in serial?

I tried it before, but performance is worse. We can provide it as a session variable I suppose.

In this case, I think snapshot read in serial may worsen the problem.
Because previously, we can do tombstone iter concurrently (just missing a "merge" step), but now we only do tombstone iter in serial across all vnodes.

chenzl25 · 2024-06-15T03:48:49Z

src/stream/src/executor/backfill/arrangement_backfill.rs

+                                        remaining_vnodes.set_bit(vnode_idx);
+                                        if remaining_vnodes.is_empty() {
+                                            break;
+                                        }


After it set a bit, could it be empty?

It should be set to 0 instead. Let me fix it later.

b29adfd Fixed.

Since Bitmap is designed to be an immutable data structure (as per its docs), it is hard to update correctly.
I have opted to convert it to a BitSet instead.

chenzl25 · 2024-06-15T03:50:00Z

src/common/src/buffer/bitmap.rs

+    pub fn set_bit(&mut self, idx: usize) {
+        assert!(idx < self.len());
+        if let Some(bits) = &mut self.bits {
+            bits[idx / BITS] |= 1 << (idx % BITS);
+        } else {
+            self.count_ones += 1;
+            if self.count_ones == self.num_bits {
+                self.bits = None;
+            }
+        }
+    }


It seems that count_ones should be updated as well.

chenzl25 · 2024-06-15T03:55:36Z

risingwave/src/stream/src/executor/backfill/arrangement_backfill.rs

Lines 730 to 731 in d488f65

// TODO(kwannoel): We can provide an option between snapshot read in parallel vs serial.

let vnode_row_iter = select_all(iterators);

Can we try to provide an option to allow arrangement backfill snapshot read to read in serial?

I tried it before, but performance is worse. We can provide it as a session variable I suppose.

In this case, I think snapshot read in serial may worsen the problem. Because previously, we can do tombstone iter concurrently (just missing a "merge" step), but now we only do tombstone iter in serial across all vnodes.

IIUC, in this tombstone case, the no-shuffle backfill snapshot read is aligned with tombstones of all vnodes so well, hence, it can make progress in parallel which is faster than snapshot read per vnode sequentially.

kwannoel · 2024-06-15T04:00:31Z

risingwave/src/stream/src/executor/backfill/arrangement_backfill.rs

Lines 730 to 731 in d488f65

// TODO(kwannoel): We can provide an option between snapshot read in parallel vs serial.

let vnode_row_iter = select_all(iterators);

Can we try to provide an option to allow arrangement backfill snapshot read to read in serial?

I tried it before, but performance is worse. We can provide it as a session variable I suppose.

In this case, I think snapshot read in serial may worsen the problem. Because previously, we can do tombstone iter concurrently (just missing a "merge" step), but now we only do tombstone iter in serial across all vnodes.

IIUC, in this tombstone case, the no-shuffle backfill snapshot read is aligned with tombstones of all vnodes so well, hence, it can make progress in parallel which is faster than snapshot read per vnode sequentially.

Yes exactly. So we should keep snapshot read in parallel for arrangement backfill as well. The only difference will be that no shuffle backfill maintains a current pos across all vnodes, so the tombstone iteration progress won't be lost. So we adapt that logic to this PR as well. Ensure per epoch, the current pos across all nodes bumps up.

kwannoel · 2024-06-15T04:04:18Z

When you mention iter in serial, you mean run each iterator sequentially right? Rather than concurrently. Or do you mean iter concurrently but do a merge sort like we do in no shuffle backfill? And maintain a current pos per partition, rather than per vnode.

chenzl25 · 2024-06-15T05:56:11Z

When you mention iter in serial, you mean run each iterator sequentially right? Rather than concurrently. Or do you mean iter concurrently but do a merge sort like we do in no shuffle backfill? And maintain a current pos per partition, rather than per vnode.

I mean run each iterator sequentially, however, in this case, it would be 256 times (if we have 256 vnodes in an executor) slower than the no-shuffle backfill concurrent one.

…e records

- In the prev epoch, the current_pos may be bumped when iterating tombstone ranges. - In such a scenario, this current pos may be an exclusive bound. - In the subsequent epoch, the current pos may be deleted, so when we scan from it, we dont actually get anything back. - Then we may finish the snapshot read in that epoch, with the current state showing that current pos is not yielded. But that is fine.

This reverts commit 47f233c.

kwannoel · 2024-06-17T03:53:07Z

It has passed the tombstone test case now. But its tombstone iteration is still worse than no shuffle backfill.
8m20s vs 6m9s.

https://buildkite.com/risingwavelabs/main-cron/builds/2649#0190244b-06f3-4872-a0f5-c2ef01c8a4cd

In main-cron the runtime for arrangement backfil tombstone scan test is around 6m as well, so there's a regression: https://buildkite.com/risingwavelabs/main-cron/builds/2644#019021c8-381b-41ba-90df-e7b66cd77834

Trying to optimize it further.

github-actions · 2024-08-17T01:53:15Z

This PR has been open for 60 days with no activity.

If it's blocked by code review, feel free to ping a reviewer or ask someone else to review it.

If you think it is still relevant today, and have time to work on it in the near future, you can comment to update the status, or just manually remove the no-pr-activity label.

You can also confidently close this PR to keep our backlog clean. (If no further action taken, the PR will be automatically closed after 7 days. Sorry! 🙏)
Don't worry if you think the PR is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

github-actions · 2024-08-24T01:54:45Z

Close this PR as there's no further actions taken after it is marked as stale for 7 days. Sorry! 🙏

You can reopen it when you have time to continue working on it.

github-actions bot added the type/fix Bug fix label Jun 14, 2024

kwannoel requested review from hzxa21 and chenzl25 June 14, 2024 16:28

kwannoel changed the title ~~fix(backfill): make progress across all vnodes per epoch~~ fix(backfill): make tombstone iteration progress across all vnodes per epoch Jun 14, 2024

hzxa21 reviewed Jun 14, 2024

View reviewed changes

kwannoel added ci/run-backwards-compat-tests Run backwards compatibility tests in your PR. ci/run-backfill-tests ci/main-cron/run-selected ci/pr/run-selected labels Jun 15, 2024

kwannoel requested a review from hzxa21 June 15, 2024 03:29

chenzl25 reviewed Jun 15, 2024

View reviewed changes

kwannoel force-pushed the kwannoel/one-per-vnode branch from 7e6e5a4 to 9ca1762 Compare June 15, 2024 16:47

kwannoel added 12 commits June 17, 2024 02:56

add frontend changes

1621f45

decode pk_inclusive flag, add it to in-memory progress

493a47e

check inclusive flag when computing bounds

38239ab

rename pk_inclusive to yielded + include logic when considering updat…

157a6cf

…e records

handle state encoding and committing

8463b7b

assert that all pk must be yielded downstream before marking as finished

381e81c

iterate at least 1 record per vnode

51b9966

add append one row

0be7fde

handle backwards compat

f7d3973

convert to mutable bitset instead

3248c73

fix no shuffle should be exclusive

519db4c

kwannoel added 8 commits June 17, 2024 02:56

fix

4955a52

expose missing records

e5529ed

check if yield + no bump position is causing the error

2934095

bump timeout

a31f61f

revert + skip snapshot read for completed partition

5618d48

cannot finish progress

c552c97

Revert "cannot finish progress"

0072373

This reverts commit 47f233c.

optimize

f51e190

kwannoel force-pushed the kwannoel/one-per-vnode branch from e55509c to f51e190 Compare June 16, 2024 18:56

kwannoel added 2 commits June 17, 2024 03:43

fix

d79205f

fix

cbb96e1

kwannoel added 3 commits June 17, 2024 12:07

remove skip buffer

36d0539

no need to yield

f91cb57

just do a snapshot read

db6dbb8

github-actions bot added the no-pr-activity label Aug 17, 2024

github-actions bot closed this Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backfill): make tombstone iteration progress across all vnodes per epoch #17266

fix(backfill): make tombstone iteration progress across all vnodes per epoch #17266

kwannoel commented Jun 14, 2024 •

edited

Loading

hzxa21 Jun 14, 2024

kwannoel Jun 14, 2024

kwannoel Jun 15, 2024 •

edited

Loading

chenzl25 commented Jun 15, 2024

kwannoel commented Jun 15, 2024 •

edited

Loading

chenzl25 Jun 15, 2024

kwannoel Jun 15, 2024

kwannoel Jun 15, 2024

chenzl25 Jun 15, 2024

chenzl25 commented Jun 15, 2024

kwannoel commented Jun 15, 2024 •

edited

Loading

kwannoel commented Jun 15, 2024 •

edited

Loading

chenzl25 commented Jun 15, 2024 •

edited

Loading

kwannoel commented Jun 17, 2024 •

edited

Loading

github-actions bot commented Aug 17, 2024

github-actions bot commented Aug 24, 2024

fix(backfill): make tombstone iteration progress across all vnodes per epoch #17266

fix(backfill): make tombstone iteration progress across all vnodes per epoch #17266

Conversation

kwannoel commented Jun 14, 2024 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

hzxa21 Jun 14, 2024

Choose a reason for hiding this comment

kwannoel Jun 14, 2024

Choose a reason for hiding this comment

kwannoel Jun 15, 2024 • edited Loading

Choose a reason for hiding this comment

chenzl25 commented Jun 15, 2024

kwannoel commented Jun 15, 2024 • edited Loading

chenzl25 Jun 15, 2024

Choose a reason for hiding this comment

kwannoel Jun 15, 2024

Choose a reason for hiding this comment

kwannoel Jun 15, 2024

Choose a reason for hiding this comment

chenzl25 Jun 15, 2024

Choose a reason for hiding this comment

chenzl25 commented Jun 15, 2024

kwannoel commented Jun 15, 2024 • edited Loading

kwannoel commented Jun 15, 2024 • edited Loading

chenzl25 commented Jun 15, 2024 • edited Loading

kwannoel commented Jun 17, 2024 • edited Loading

github-actions bot commented Aug 17, 2024

github-actions bot commented Aug 24, 2024

kwannoel commented Jun 14, 2024 •

edited

Loading

kwannoel Jun 15, 2024 •

edited

Loading

kwannoel commented Jun 15, 2024 •

edited

Loading

kwannoel commented Jun 15, 2024 •

edited

Loading

kwannoel commented Jun 15, 2024 •

edited

Loading

chenzl25 commented Jun 15, 2024 •

edited

Loading

kwannoel commented Jun 17, 2024 •

edited

Loading