-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(backfill): make tombstone iteration progress across all vnodes per epoch #17266
Conversation
// `yielded` column | ||
// Only present for arrangement backfill. | ||
if is_arrangement_backfill { | ||
catalog_builder.add_column(&Field::with_name(DataType::Boolean, "yielded")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to ensure backward compatibility after adding this new column? In other words, is it possible that an arrangement backfill is triggered in old version and resume in new version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I will handle it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handled in 66f44cb.
We only need to handle the deserialization path. Because backfill jobs should fall into one of 3 categories:
- Finished -> They won't write again to state table, no need to handle encoding path. Need to still read from state table, so must handle decoding path.
- Not finished -> They will be cleaned after upgrade, no need to handle backwards compat at all.
- Background ddl -> They won't be cleaned upgrade, and will panic. We can let users recreate these jobs. I think it's unlikely to encounter this since stream job creation should not take a long time. But if they do encounter, recreate stream job is an easy solution.
risingwave/src/stream/src/executor/backfill/arrangement_backfill.rs Lines 730 to 731 in d488f65
Can we try to provide an option to allow arrangement backfill |
I tried it before, but performance is worse. We can provide it as a session variable I suppose. In this case, I think snapshot read in serial may worsen the problem. |
remaining_vnodes.set_bit(vnode_idx); | ||
if remaining_vnodes.is_empty() { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After it set a bit, could it be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be set to 0 instead. Let me fix it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b29adfd Fixed.
Since Bitmap
is designed to be an immutable data structure (as per its docs), it is hard to update correctly.
I have opted to convert it to a BitSet
instead.
src/common/src/buffer/bitmap.rs
Outdated
pub fn set_bit(&mut self, idx: usize) { | ||
assert!(idx < self.len()); | ||
if let Some(bits) = &mut self.bits { | ||
bits[idx / BITS] |= 1 << (idx % BITS); | ||
} else { | ||
self.count_ones += 1; | ||
if self.count_ones == self.num_bits { | ||
self.bits = None; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that count_ones
should be updated as well.
IIUC, in this tombstone case, the no-shuffle backfill snapshot read is aligned with tombstones of all vnodes so well, hence, it can make progress in parallel which is faster than snapshot read per vnode sequentially. |
Yes exactly. So we should keep snapshot read in parallel for arrangement backfill as well. The only difference will be that no shuffle backfill maintains a current pos across all vnodes, so the tombstone iteration progress won't be lost. So we adapt that logic to this PR as well. Ensure per epoch, the current pos across all nodes bumps up. |
When you mention iter in serial, you mean run each iterator sequentially right? Rather than concurrently. Or do you mean iter concurrently but do a merge sort like we do in no shuffle backfill? And maintain a current pos per partition, rather than per vnode. |
I mean run each iterator sequentially, however, in this case, it would be 256 times (if we have 256 vnodes in an executor) slower than the no-shuffle backfill concurrent one. |
7e6e5a4
to
9ca1762
Compare
- In the prev epoch, the current_pos may be bumped when iterating tombstone ranges. - In such a scenario, this current pos may be an exclusive bound. - In the subsequent epoch, the current pos may be deleted, so when we scan from it, we dont actually get anything back. - Then we may finish the snapshot read in that epoch, with the current state showing that current pos is not yielded. But that is fine.
This reverts commit 47f233c.
e55509c
to
f51e190
Compare
It has passed the tombstone test case now. But its tombstone iteration is still worse than no shuffle backfill. https://buildkite.com/risingwavelabs/main-cron/builds/2649#0190244b-06f3-4872-a0f5-c2ef01c8a4cd In main-cron the runtime for arrangement backfil tombstone scan test is around Trying to optimize it further. |
This PR has been open for 60 days with no activity. If it's blocked by code review, feel free to ping a reviewer or ask someone else to review it. If you think it is still relevant today, and have time to work on it in the near future, you can comment to update the status, or just manually remove the You can also confidently close this PR to keep our backlog clean. (If no further action taken, the PR will be automatically closed after 7 days. Sorry! 🙏) |
Close this PR as there's no further actions taken after it is marked as stale for 7 days. Sorry! 🙏 You can reopen it when you have time to continue working on it. |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
Thanks @hzxa21 for sharing his diagnosis and ideas to solve it.
Resolve #17267
This is required because if we only yield the progress of a single vnode, the tombstone iteration of the other vnodes are dropped.
This is not an issue in no shuffle backfill, because the current pos is maintained on a parallelism level for all vnodes in it, rather than on vnode level like arrangement backfill.
We cannot simply yield all the scanned records downstream however, otherwise in each epoch, we now have the invariant of 256 records per epoch for backfill. If the processing latency of 256 records is high, barrier latency will be high.
So what we do is to add an
yielded
flag. In the case where there's no snapshot read in an epoch, we will setyielded
flag to false, and continue to bump the current progress. Since we don't yield the records which were read from the snapshot, in the subsequent epoch, we will scan these same records, but use an inclusive bound with them, so they will be included in the newly scanned snapshot.For normal snapshot read,
yielded
will always be true.Want to add a separate test later.
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.