-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): optimize data alignment for default compaction group #13075
Conversation
…nto li0k/storage_cut_sst_file
|
GitGuardian id | Secret | Commit | Filename | |
---|---|---|---|---|
7648795 | Generic CLI Secret | 0e21215 | integration_tests/iceberg-cdc/docker-compose.yml | View secret |
7648795 | Generic CLI Secret | 0e21215 | integration_tests/iceberg-cdc/run_test.sh | View secret |
🛠 Guidelines to remediate hardcoded secrets
- Understand the implications of revoking this secret by investigating where it is used in your code.
- Replace and store your secrets safely. Learn here the best practices.
- Revoke and rotate these secrets.
- If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.
To avoid such incidents in the future consider
- following these best practices for managing and storing secrets including API keys and other credentials
- install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.
🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.
Our GitHub checks need improvements? Share your feedbacks!
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #13075 +/- ##
==========================================
+ Coverage 68.15% 68.22% +0.06%
==========================================
Files 1524 1524
Lines 262331 262492 +161
==========================================
+ Hits 178793 179075 +282
+ Misses 83538 83417 -121
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
self.table_partition_vnode.get(&user_key.table_id.table_id); | ||
|
||
if new_vnode_partition_count.is_some() | ||
|| self.table_partition_vnode.contains_key(&self.last_table_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why check self.table_partition_vnode.contains_key(&self.last_table_id)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example:
table_partition_vnode = [(1, 1)]
sst_1 (table_id 1) -------- builded
sst_2 (table_id 1) -------- building
add full_key (table_id 2), due to table_id = 1 in table_partition_vnode, we should finish the sst_2 first and create the new sst_3 for table_id 2 that makeing the better data alignment.
after add_full_key
sst_1 (table_id 1) -------- builded
sst_2 (table_id 1) -------- builded
sst_3 (table_id 2) -------- building
We need to run some perf benchmark to validate the gain of this optimization. As discussed in the meeting, we can either use the mirror cluster or use datagen to mimic the workload. |
3c38e42
to
3ef2635
Compare
This PR basically makes the cut by table and vnode partition dynamic for creating tables. Random thought: if we apply this mechanism to all tables, do we still need compaction group split? What extra benefic can compaction group split bring? |
…nto li0k/storage_cut_sst_file
…nto li0k/storage_cut_sst_file
@@ -33,9 +33,10 @@ backend = "Mem" | |||
periodic_space_reclaim_compaction_interval_sec = 3600 | |||
periodic_ttl_reclaim_compaction_interval_sec = 1800 | |||
periodic_tombstone_reclaim_compaction_interval_sec = 600 | |||
periodic_split_compact_group_interval_sec = 180 | |||
periodic_split_compact_group_interval_sec = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I think this check should be more sensitive (for backfill), what do you think?
group_to_table_vnode_partition: | ||
parking_lot::RwLock<HashMap<CompactionGroupId, BTreeMap<TableId, u32>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... This is always mutated when we hold the compaction lock. Why not put it under Compaction
?
src/meta/src/hummock/manager/mod.rs
Outdated
@@ -2532,6 +2586,7 @@ impl HummockManager { | |||
} | |||
} | |||
|
|||
table_vnode_partition_mapping.insert(*table_id, partition_vnode_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vnode parition mapping set in L2553 can be overwritten here. Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'm going to remove it, currently we'll set it up when we build compact_task
proto/hummock.proto
Outdated
bool split_by_state_table = 21; | ||
// Compaction needs to cut the state table every time 1/weight of vnodes in the table have been processed. | ||
uint32 split_weight_by_vnode = 22; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they deprecated now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No in this PR, the new table_vnode_partition
is not persistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not persistent? Are split_byt_state_table
and split_weight_by_vnode
persistent? Why does persistence affect the deprecation of these fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, we should document what fields in CompactTask are persistent and what fields are not.
Benchmark for this PR: #13037 (comment) |
src/meta/src/hummock/manager/mod.rs
Outdated
@@ -2557,6 +2612,13 @@ impl HummockManager { | |||
} | |||
} | |||
} | |||
|
|||
tracing::info!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this log be too frequently? I think we do not need to print it every times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change it to debug, we can provide a separate interface to get table_vnode_partition if needed
When we split one table to an independent group. it means that we caculate the score of its data independently, and we also use a larger base level size for it. |
…nto li0k/storage_cut_sst_file
…nto li0k/storage_cut_sst_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
proto/hummock.proto
Outdated
bool split_by_state_table = 21; | ||
// Compaction needs to cut the state table every time 1/weight of vnodes in the table have been processed. | ||
uint32 split_weight_by_vnode = 22; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not persistent? Are split_byt_state_table
and split_weight_by_vnode
persistent? Why does persistence affect the deprecation of these fields?
proto/hummock.proto
Outdated
bool split_by_state_table = 21; | ||
// Compaction needs to cut the state table every time 1/weight of vnodes in the table have been processed. | ||
uint32 split_weight_by_vnode = 22; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, we should document what fields in CompactTask are persistent and what fields are not.
src/meta/src/hummock/manager/mod.rs
Outdated
} | ||
} | ||
|
||
if !created_tables.contains(table_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits: this can be !is_creating_table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be is_creating_table
src/meta/src/hummock/manager/mod.rs
Outdated
// When in a hybrid compaction group, data from multiple state tables may exist in a single sst, and in order to make the data in the sub level more aligned, a proactive cut is made for the data. | ||
// https://github.com/risingwavelabs/risingwave/issues/13037 | ||
// 1. In a backfill scenario, the state_table / mv may have high write throughput. Therefore, we relax the limit of `is_low_write_throughput` and partition the table with high write throughput by vnode to improve the parallel efficiency of compaction. | ||
// 2. For table with low throughput, partition by table_id to minimize amplification. | ||
// 3. When the write mode is changed (the above conditions are not met), the default behavior is restored | ||
if !is_low_write_throughput { | ||
table_vnode_partition_mapping.insert(*table_id, hybrid_vnode_count); | ||
} else if state_table_size > self.env.opts.cut_table_size_limit { | ||
table_vnode_partition_mapping.insert(*table_id, SPLIT_BY_TABLE); | ||
} else { | ||
// | ||
table_vnode_partition_mapping.remove(table_id); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments here indicate that case 1 is only for backfill (i.e. creating table ids) but in the codes it applies to created table ids as well because it is not guarded by if is_creating_table
. Is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its expected,
What I'm trying to explain is that tables that meet this condition are more likely to appear in backfill scenarios, since hummock currently prohibits creating tables from being split.
In short, we allow both the creating / created table to match this rule, and when created meets the split condition, it's more likely to be split away, I've changed the comment to make it easier to understand.
src/meta/src/hummock/manager/mod.rs
Outdated
if group_config.compaction_config.split_weight_by_vnode > 0 { | ||
// TODO: unify to table_to_vnode_partition directly | ||
table_to_vnode_partition.insert( | ||
member_table_ids[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an implicit assumption here: there is only a single item in member_table_ids
when the if condition is met. We should add an assertion to check or throw an error if the assumption doesn't hold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of updating table_to_vnode_partition ad-hoc in get_compact_task
, I think it is more appropriate to update self.group_to_table_vnode_partition
correspondingly in the following cases:
- A new group is created.
split_weight_by_vnode
is modified for an existing group.- New table ids are added to a group.
Maybe that is what you mean by "// TODO: unify to table_to_vnode_partition directly"?
src/meta/src/hummock/manager/mod.rs
Outdated
let window_size = HISTORY_TABLE_INFO_STATISTIC_TIME / (checkpoint_secs as usize); | ||
|
||
let mut group_to_table_vnode_partition = HashMap::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will clear group_to_table_vnode_partition and re-calculate it. Just to double check: do we guarantee that group_infos
will contain all compaction groups and all table ids?
@@ -230,13 +230,21 @@ async fn compact_shared_buffer( | |||
let mut compaction_futures = vec![]; | |||
let use_block_based_filter = BlockedXor16FilterBuilder::is_kv_count_too_large(total_key_count); | |||
|
|||
let table_vnode_partition = if existing_table_ids.len() == 1 { | |||
let table_id = existing_table_ids.iter().next().unwrap(); | |||
vec![(*table_id, split_weight_by_vnode as u32)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consider split_weight_by_vnode==0, or 1?
@@ -206,7 +198,11 @@ where | |||
if switch_builder { | |||
need_seal_current = true; | |||
} else if builder.reach_capacity() { | |||
if self.split_weight_by_vnode == 0 || builder.reach_max_sst_size() { | |||
let is_split_table = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value can be 0.
There may be other places where corner cases split_weight_by_vnode is 0 or 1 happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the new version, contains_key
is used to identify whether partition needs to be performed on the table, while the old version sets split_weight_by_vnode = 0 for the default group (which do not partition the tables that in default group).
VirtualNode::COUNT / (self.split_weight_by_vnode as usize) - 1; | ||
} else { | ||
// default | ||
self.largest_vnode_in_current_partition = VirtualNode::MAX.to_index(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we forget to set largest_vnode_in_current_partition when split_weight_by_vnode is 1 before this PR? Will it cause problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
largest_vnode_in_current_partition: VirtualNode::MAX.to_index()
the default value of largest_vnode_in_current_partition is VirtualNode::MAX.to_index()
, it wil not cause problem even if we set the split_weight_by_vnode to 0
or 1
.
In short, we can turn off this feature by setting vnode_partition_count to 0
…nto li0k/storage_cut_sst_file
…nto li0k/storage_cut_sst_file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM!
// 1. throughput and size are checked periodically and calculated according to the rules | ||
// 2. A new group is created (split) | ||
// 3. split_weight_by_vnode is modified for an existing group. (not supported yet) | ||
group_to_table_vnode_partition: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also document what does it mean if:
- a table id is not found in this mapping
- a table id is found in this mapping with 0 as the vnode partition number
- a table id is found in this mapping with 1 as the vnode partition numner
…nto li0k/storage_cut_sst_file
…nto li0k/storage_cut_sst_file
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
#13037
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.