Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(backfill): make tombstone iteration progress across all vnodes per epoch #17266

Closed
wants to merge 25 commits into from

Conversation

kwannoel
Copy link
Contributor

@kwannoel kwannoel commented Jun 14, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Thanks @hzxa21 for sharing his diagnosis and ideas to solve it.

Resolve #17267

This is required because if we only yield the progress of a single vnode, the tombstone iteration of the other vnodes are dropped.
This is not an issue in no shuffle backfill, because the current pos is maintained on a parallelism level for all vnodes in it, rather than on vnode level like arrangement backfill.

We cannot simply yield all the scanned records downstream however, otherwise in each epoch, we now have the invariant of 256 records per epoch for backfill. If the processing latency of 256 records is high, barrier latency will be high.
So what we do is to add an yielded flag. In the case where there's no snapshot read in an epoch, we will set yielded flag to false, and continue to bump the current progress. Since we don't yield the records which were read from the snapshot, in the subsequent epoch, we will scan these same records, but use an inclusive bound with them, so they will be included in the newly scanned snapshot.

For normal snapshot read, yielded will always be true.

Want to add a separate test later.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@github-actions github-actions bot added the type/fix Bug fix label Jun 14, 2024
@kwannoel kwannoel requested review from hzxa21 and chenzl25 June 14, 2024 16:28
@kwannoel kwannoel changed the title fix(backfill): make progress across all vnodes per epoch fix(backfill): make tombstone iteration progress across all vnodes per epoch Jun 14, 2024
// `yielded` column
// Only present for arrangement backfill.
if is_arrangement_backfill {
catalog_builder.add_column(&Field::with_name(DataType::Boolean, "yielded"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to ensure backward compatibility after adding this new column? In other words, is it possible that an arrangement backfill is triggered in old version and resume in new version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I will handle it later.

Copy link
Contributor Author

@kwannoel kwannoel Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handled in 66f44cb.

We only need to handle the deserialization path. Because backfill jobs should fall into one of 3 categories:

  1. Finished -> They won't write again to state table, no need to handle encoding path. Need to still read from state table, so must handle decoding path.
  2. Not finished -> They will be cleaned after upgrade, no need to handle backwards compat at all.
  3. Background ddl -> They won't be cleaned upgrade, and will panic. We can let users recreate these jobs. I think it's unlikely to encounter this since stream job creation should not take a long time. But if they do encounter, recreate stream job is an easy solution.

@chenzl25
Copy link
Contributor

// TODO(kwannoel): We can provide an option between snapshot read in parallel vs serial.
let vnode_row_iter = select_all(iterators);

Can we try to provide an option to allow arrangement backfill snapshot read to read in serial?

@kwannoel kwannoel requested a review from hzxa21 June 15, 2024 03:29
@kwannoel
Copy link
Contributor Author

kwannoel commented Jun 15, 2024

// TODO(kwannoel): We can provide an option between snapshot read in parallel vs serial.
let vnode_row_iter = select_all(iterators);

Can we try to provide an option to allow arrangement backfill snapshot read to read in serial?

I tried it before, but performance is worse. We can provide it as a session variable I suppose.

In this case, I think snapshot read in serial may worsen the problem.
Because previously, we can do tombstone iter concurrently (just missing a "merge" step), but now we only do tombstone iter in serial across all vnodes.

Comment on lines 411 to 436
remaining_vnodes.set_bit(vnode_idx);
if remaining_vnodes.is_empty() {
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After it set a bit, could it be empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be set to 0 instead. Let me fix it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b29adfd Fixed.

Since Bitmap is designed to be an immutable data structure (as per its docs), it is hard to update correctly.
I have opted to convert it to a BitSet instead.

Comment on lines 464 to 469
pub fn set_bit(&mut self, idx: usize) {
assert!(idx < self.len());
if let Some(bits) = &mut self.bits {
bits[idx / BITS] |= 1 << (idx % BITS);
} else {
self.count_ones += 1;
if self.count_ones == self.num_bits {
self.bits = None;
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that count_ones should be updated as well.

@chenzl25
Copy link
Contributor

// TODO(kwannoel): We can provide an option between snapshot read in parallel vs serial.
let vnode_row_iter = select_all(iterators);

Can we try to provide an option to allow arrangement backfill snapshot read to read in serial?

I tried it before, but performance is worse. We can provide it as a session variable I suppose.

In this case, I think snapshot read in serial may worsen the problem. Because previously, we can do tombstone iter concurrently (just missing a "merge" step), but now we only do tombstone iter in serial across all vnodes.

IIUC, in this tombstone case, the no-shuffle backfill snapshot read is aligned with tombstones of all vnodes so well, hence, it can make progress in parallel which is faster than snapshot read per vnode sequentially.

@kwannoel
Copy link
Contributor Author

kwannoel commented Jun 15, 2024

// TODO(kwannoel): We can provide an option between snapshot read in parallel vs serial.
let vnode_row_iter = select_all(iterators);

Can we try to provide an option to allow arrangement backfill snapshot read to read in serial?

I tried it before, but performance is worse. We can provide it as a session variable I suppose.

In this case, I think snapshot read in serial may worsen the problem. Because previously, we can do tombstone iter concurrently (just missing a "merge" step), but now we only do tombstone iter in serial across all vnodes.

IIUC, in this tombstone case, the no-shuffle backfill snapshot read is aligned with tombstones of all vnodes so well, hence, it can make progress in parallel which is faster than snapshot read per vnode sequentially.

Yes exactly. So we should keep snapshot read in parallel for arrangement backfill as well. The only difference will be that no shuffle backfill maintains a current pos across all vnodes, so the tombstone iteration progress won't be lost. So we adapt that logic to this PR as well. Ensure per epoch, the current pos across all nodes bumps up.

@kwannoel
Copy link
Contributor Author

kwannoel commented Jun 15, 2024

When you mention iter in serial, you mean run each iterator sequentially right? Rather than concurrently. Or do you mean iter concurrently but do a merge sort like we do in no shuffle backfill? And maintain a current pos per partition, rather than per vnode.

@chenzl25
Copy link
Contributor

chenzl25 commented Jun 15, 2024

When you mention iter in serial, you mean run each iterator sequentially right? Rather than concurrently. Or do you mean iter concurrently but do a merge sort like we do in no shuffle backfill? And maintain a current pos per partition, rather than per vnode.

I mean run each iterator sequentially, however, in this case, it would be 256 times (if we have 256 vnodes in an executor) slower than the no-shuffle backfill concurrent one.

@kwannoel kwannoel force-pushed the kwannoel/one-per-vnode branch from 7e6e5a4 to 9ca1762 Compare June 15, 2024 16:47
kwannoel added 12 commits June 17, 2024 02:56
- In the prev epoch, the current_pos may be bumped when iterating tombstone ranges.
- In such a scenario, this current pos may be an exclusive bound.
- In the subsequent epoch, the current pos may be deleted, so when we scan from it, we dont actually get anything back.
- Then we may finish the snapshot read in that epoch, with the current state showing that current pos is not yielded. But that is fine.
@kwannoel kwannoel force-pushed the kwannoel/one-per-vnode branch from e55509c to f51e190 Compare June 16, 2024 18:56
@kwannoel
Copy link
Contributor Author

kwannoel commented Jun 17, 2024

It has passed the tombstone test case now. But its tombstone iteration is still worse than no shuffle backfill.
8m20s vs 6m9s.

https://buildkite.com/risingwavelabs/main-cron/builds/2649#0190244b-06f3-4872-a0f5-c2ef01c8a4cd

Screenshot 2024-06-17 at 11 50 49 AM

In main-cron the runtime for arrangement backfil tombstone scan test is around 6m as well, so there's a regression: https://buildkite.com/risingwavelabs/main-cron/builds/2644#019021c8-381b-41ba-90df-e7b66cd77834
Screenshot 2024-06-17 at 11 52 07 AM

Trying to optimize it further.

Copy link
Contributor

This PR has been open for 60 days with no activity.

If it's blocked by code review, feel free to ping a reviewer or ask someone else to review it.

If you think it is still relevant today, and have time to work on it in the near future, you can comment to update the status, or just manually remove the no-pr-activity label.

You can also confidently close this PR to keep our backlog clean. (If no further action taken, the PR will be automatically closed after 7 days. Sorry! 🙏)
Don't worry if you think the PR is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

Copy link
Contributor

Close this PR as there's no further actions taken after it is marked as stale for 7 days. Sorry! 🙏

You can reopen it when you have time to continue working on it.

@github-actions github-actions bot closed this Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

arrangement backfill can be slow when there are consecutive tombstones in upstream table
3 participants