-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(storage): fix flush small files when the capacity of shared-buffer is full #15832
fix(storage): fix flush small files when the capacity of shared-buffer is full #15832
Conversation
Another problem is that if there are several small upload task with epoch-1 keep running, EventHandler can not poll the tasks with epoch-2 or larger. |
Signed-off-by: Little-Wallace <[email protected]>
0826e05
to
a144d7c
Compare
.iter() | ||
.flat_map(|imm| imm.epochs().clone()) | ||
.sorted() | ||
.dedup() | ||
.collect_vec(); | ||
|
||
// reverse to make newer epochs comes first | ||
epochs.reverse(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be any side-effect for this change? cc @wenym1
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
.iter() | ||
.rev() | ||
.is_sorted_by_key(|imm| imm.batch_id())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the imm_ids in self.staging.data
are still sorted. Why do we remove this assertion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because some of them may be sstable file, and sstable file does not have a batch id.
) | ||
} | ||
StagingData::Sst(staging_sst) => { | ||
let sst_min_epoch = *staging_sst.epochs.first().expect("epochs not empty"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use the max_epoch in staging sst previously? Sounds like a bug. cc @Li0k
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. If the min epoch is smaller than max_epoch_inclusive
, we think the data range may be overlap with query range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion. I mean using the max_epoch
previously is a bug. Using min_epoch
is the correct way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Catch, use min_epoch to filter is a better way, even if we don’t have such a usage scenario now
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
StagingData::Sst(_) => true, | ||
}); | ||
self.staging.data.push(StagingData::Sst(staging_sst_ref)); | ||
self.staging.data.sort_by_key(|data| data.min_epoch()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have more than one imm/sst belong to the same epoch, will sorting by epoch breaks the correct ordering? I still think sorting by epoch is a bit risky. Given that we have found the consecutive IMMs to be replaced, isn't it more straight-forward to replace these IMMs with the staging SST instead of pushing staging sst to the end and sorting by min epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenym1 Since that we add offset to tail of epoch, could we assume that every memtable own different epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, every memtable of a single local state store instance owns different epochs. But I'm afraid that we cannot use the min_epoch of sst here.
Let's use epoch<a>-<b>
to represent an EpochWithGap
with real epoch to be a
and inner offset as b
, and we have two instances (A
and B
) of local state store.
In epoch1
, A
writes an imm with epoch1-1, and then a spill happened and generated a SST with min epoch as epoch1-1. And then B
writes an imm with epoch1-1, and then A
writes an imm with epoch1-2, and then a spill happened and generated a SST with min epoch also as epoch1-1. Both ssts contains the data of A
, so they will both appear in A
, but they have the same min epoch as epoch1-1
, and this will cause some order problem if we simply sort by min epoch.
I think #15832 (comment) can solve the problem in a more elegant way, and we can avoid the sorting to reduce the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I refactor this code and replace the memtable with staing-data rather than sort all array.
I think refactor it with BTreeMap
can left in next PR
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
Discussed offline: Given that |
self.staging | ||
.data | ||
.insert(first_pos, StagingData::Sst(staging_sst_ref)); | ||
self.staging.data.retain(|data| match data { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retain
will iterate over all imms and ssts. It has been observed that when there are many actors, this cost will be amplified greatly, and cause the event handler stuck. With #16725 merged, in staging_sst_ref
, we can now know what imm id should be replaced. We shall change to maintain a btreemap on the staging data as suggested in the previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I will refactor it later.
Signed-off-by: Little-Wallace <[email protected]>
Signed-off-by: Little-Wallace <[email protected]>
0c25eb2
to
df7b13a
Compare
} | ||
let mut flush_unseal = self.sealed_data.flush(&self.context, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: this triggers flush for sealed_data
so it is better call it flushed
instead of flush_unseal
@@ -1433,8 +1457,14 @@ mod tests { | |||
} | |||
|
|||
#[tokio::test] | |||
async fn test_uploader_finish_in_order() { | |||
let (buffer_tracker, mut uploader, new_task_notifier) = prepare_uploader_order_test(); | |||
async fn test_uploader_finish_not_in_order() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of test_uploader_finish_in_order
is unchanged, meaning that it is still testing "finish_in_order". Can we preserve the previous name?
} | ||
let imm = gen_imm_with_limiter(epoch1, Some(memory_limiter.as_ref())).await; | ||
uploader.add_imm(imm); | ||
assert!(uploader.may_flush()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to checking whether may_flush
is triggered, we should assert that epoch2's data is flushed instead of epoch1. IIUC, may_flush
will triggered with and without this PR. The real behavior change is epoch2 will be flushed first before epoch1.
After discussing with @wenym1 offline, we found that there is a critical pitfall of the implementation of this PR, which can affect the correctness. Let's use
|
This PR has been open for 60 days with no activity. If it's blocked by code review, feel free to ping a reviewer or ask someone else to review it. If you think it is still relevant today, and have time to work on it in the near future, you can comment to update the status, or just manually remove the You can also confidently close this PR to keep our backlog clean. (If no further action taken, the PR will be automatically closed after 7 days. Sorry! 🙏) |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
When there are several barriers try to write data to hummock, hummock may flush a lot of small files because we flush data in barrier order instead of size order.
I show the wrong case in unit-test, and it fails in main branch.
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.