Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): optimize data alignment for default compaction group #13075

Merged
merged 19 commits into from
Dec 1, 2023

Conversation

Li0k
Copy link
Contributor

@Li0k Li0k commented Oct 26, 2023

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

#13037

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@gitguardian
Copy link

gitguardian bot commented Oct 31, 2023

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id Secret Commit Filename
7648795 Generic CLI Secret 0e21215 integration_tests/iceberg-cdc/docker-compose.yml View secret
7648795 Generic CLI Secret 0e21215 integration_tests/iceberg-cdc/run_test.sh View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@Li0k Li0k marked this pull request as ready for review October 31, 2023 09:55
@codecov
Copy link

codecov bot commented Oct 31, 2023

Codecov Report

Attention: 167 lines in your changes are missing coverage. Please review.

Comparison is base (da79ff5) 68.15% compared to head (840dca3) 68.22%.
Report is 1 commits behind head on main.

Files Patch % Lines
src/meta/src/hummock/manager/mod.rs 29.22% 155 Missing ⚠️
src/common/src/array/arrow.rs 69.23% 8 Missing ⚠️
src/meta/node/src/lib.rs 0.00% 2 Missing ⚠️
src/meta/src/manager/catalog/mod.rs 0.00% 1 Missing ⚠️
...age/src/hummock/compactor/shared_buffer_compact.rs 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #13075      +/-   ##
==========================================
+ Coverage   68.15%   68.22%   +0.06%     
==========================================
  Files        1524     1524              
  Lines      262331   262492     +161     
==========================================
+ Hits       178793   179075     +282     
+ Misses      83538    83417     -121     
Flag Coverage Δ
rust 68.22% <57.39%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

self.table_partition_vnode.get(&user_key.table_id.table_id);

if new_vnode_partition_count.is_some()
|| self.table_partition_vnode.contains_key(&self.last_table_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why check self.table_partition_vnode.contains_key(&self.last_table_id) ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

Copy link
Contributor Author

@Li0k Li0k Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example:

table_partition_vnode = [(1, 1)]

sst_1 (table_id  1) -------- builded
sst_2 (table_id 1) -------- building

add full_key (table_id 2), due to table_id = 1 in table_partition_vnode, we should finish the sst_2 first and create the new sst_3 for table_id 2 that makeing the better data alignment.


after add_full_key

sst_1 (table_id  1) -------- builded
sst_2 (table_id 1) -------- builded

sst_3 (table_id 2) -------- building

@hzxa21
Copy link
Collaborator

hzxa21 commented Nov 1, 2023

We need to run some perf benchmark to validate the gain of this optimization. As discussed in the meeting, we can either use the mirror cluster or use datagen to mimic the workload.

@Li0k Li0k force-pushed the li0k/storage_cut_sst_file branch from 3c38e42 to 3ef2635 Compare November 1, 2023 11:45
@hzxa21
Copy link
Collaborator

hzxa21 commented Nov 2, 2023

This PR basically makes the cut by table and vnode partition dynamic for creating tables. Random thought: if we apply this mechanism to all tables, do we still need compaction group split? What extra benefic can compaction group split bring?

@@ -33,9 +33,10 @@ backend = "Mem"
periodic_space_reclaim_compaction_interval_sec = 3600
periodic_ttl_reclaim_compaction_interval_sec = 1800
periodic_tombstone_reclaim_compaction_interval_sec = 600
periodic_split_compact_group_interval_sec = 180
periodic_split_compact_group_interval_sec = 10
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change expected?

Copy link
Contributor Author

@Li0k Li0k Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think this check should be more sensitive (for backfill), what do you think?

Comment on lines +149 to +150
group_to_table_vnode_partition:
parking_lot::RwLock<HashMap<CompactionGroupId, BTreeMap<TableId, u32>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... This is always mutated when we hold the compaction lock. Why not put it under Compaction?

src/meta/src/hummock/manager/mod.rs Outdated Show resolved Hide resolved
src/meta/src/hummock/manager/mod.rs Outdated Show resolved Hide resolved
@@ -2532,6 +2586,7 @@ impl HummockManager {
}
}

table_vnode_partition_mapping.insert(*table_id, partition_vnode_count);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vnode parition mapping set in L2553 can be overwritten here. Is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'm going to remove it, currently we'll set it up when we build compact_task

Comment on lines 310 to 312
bool split_by_state_table = 21;
// Compaction needs to cut the state table every time 1/weight of vnodes in the table have been processed.
uint32 split_weight_by_vnode = 22;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they deprecated now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No in this PR, the new table_vnode_partition is not persistent

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not persistent? Are split_byt_state_table and split_weight_by_vnode persistent? Why does persistence affect the deprecation of these fields?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we should document what fields in CompactTask are persistent and what fields are not.

@hzxa21
Copy link
Collaborator

hzxa21 commented Nov 15, 2023

Benchmark for this PR: #13037 (comment)

@@ -2557,6 +2612,13 @@ impl HummockManager {
}
}
}

tracing::info!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this log be too frequently? I think we do not need to print it every times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to debug, we can provide a separate interface to get table_vnode_partition if needed

@Little-Wallace
Copy link
Contributor

This PR basically makes the cut by table and vnode partition dynamic for creating tables. Random thought: if we apply this mechanism to all tables, do we still need compaction group split? What extra benefic can compaction group split bring?

When we split one table to an independent group. it means that we caculate the score of its data independently, and we also use a larger base level size for it.

@Li0k Li0k requested a review from zwang28 November 20, 2023 07:21
Copy link
Contributor

@Little-Wallace Little-Wallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 310 to 312
bool split_by_state_table = 21;
// Compaction needs to cut the state table every time 1/weight of vnodes in the table have been processed.
uint32 split_weight_by_vnode = 22;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not persistent? Are split_byt_state_table and split_weight_by_vnode persistent? Why does persistence affect the deprecation of these fields?

Comment on lines 310 to 312
bool split_by_state_table = 21;
// Compaction needs to cut the state table every time 1/weight of vnodes in the table have been processed.
uint32 split_weight_by_vnode = 22;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, we should document what fields in CompactTask are persistent and what fields are not.

}
}

if !created_tables.contains(table_id) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: this can be !is_creating_table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be is_creating_table

Comment on lines 2591 to 2604
// When in a hybrid compaction group, data from multiple state tables may exist in a single sst, and in order to make the data in the sub level more aligned, a proactive cut is made for the data.
// https://github.com/risingwavelabs/risingwave/issues/13037
// 1. In a backfill scenario, the state_table / mv may have high write throughput. Therefore, we relax the limit of `is_low_write_throughput` and partition the table with high write throughput by vnode to improve the parallel efficiency of compaction.
// 2. For table with low throughput, partition by table_id to minimize amplification.
// 3. When the write mode is changed (the above conditions are not met), the default behavior is restored
if !is_low_write_throughput {
table_vnode_partition_mapping.insert(*table_id, hybrid_vnode_count);
} else if state_table_size > self.env.opts.cut_table_size_limit {
table_vnode_partition_mapping.insert(*table_id, SPLIT_BY_TABLE);
} else {
//
table_vnode_partition_mapping.remove(table_id);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments here indicate that case 1 is only for backfill (i.e. creating table ids) but in the codes it applies to created table ids as well because it is not guarded by if is_creating_table. Is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its expected,
What I'm trying to explain is that tables that meet this condition are more likely to appear in backfill scenarios, since hummock currently prohibits creating tables from being split.

In short, we allow both the creating / created table to match this rule, and when created meets the split condition, it's more likely to be split away, I've changed the comment to make it easier to understand.

if group_config.compaction_config.split_weight_by_vnode > 0 {
// TODO: unify to table_to_vnode_partition directly
table_to_vnode_partition.insert(
member_table_ids[0],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an implicit assumption here: there is only a single item in member_table_ids when the if condition is met. We should add an assertion to check or throw an error if the assumption doesn't hold.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of updating table_to_vnode_partition ad-hoc in get_compact_task, I think it is more appropriate to update self.group_to_table_vnode_partition correspondingly in the following cases:

  1. A new group is created.
  2. split_weight_by_vnode is modified for an existing group.
  3. New table ids are added to a group.

Maybe that is what you mean by "// TODO: unify to table_to_vnode_partition directly"?

let window_size = HISTORY_TABLE_INFO_STATISTIC_TIME / (checkpoint_secs as usize);

let mut group_to_table_vnode_partition = HashMap::default();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will clear group_to_table_vnode_partition and re-calculate it. Just to double check: do we guarantee that group_infos will contain all compaction groups and all table ids?

@@ -230,13 +230,21 @@ async fn compact_shared_buffer(
let mut compaction_futures = vec![];
let use_block_based_filter = BlockedXor16FilterBuilder::is_kv_count_too_large(total_key_count);

let table_vnode_partition = if existing_table_ids.len() == 1 {
let table_id = existing_table_ids.iter().next().unwrap();
vec![(*table_id, split_weight_by_vnode as u32)]
Copy link
Contributor

@zwang28 zwang28 Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider split_weight_by_vnode==0, or 1?

@@ -206,7 +198,11 @@ where
if switch_builder {
need_seal_current = true;
} else if builder.reach_capacity() {
if self.split_weight_by_vnode == 0 || builder.reach_max_sst_size() {
let is_split_table = self
Copy link
Contributor

@zwang28 zwang28 Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value can be 0.
There may be other places where corner cases split_weight_by_vnode is 0 or 1 happen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new version, contains_key is used to identify whether partition needs to be performed on the table, while the old version sets split_weight_by_vnode = 0 for the default group (which do not partition the tables that in default group).

VirtualNode::COUNT / (self.split_weight_by_vnode as usize) - 1;
} else {
// default
self.largest_vnode_in_current_partition = VirtualNode::MAX.to_index();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we forget to set largest_vnode_in_current_partition when split_weight_by_vnode is 1 before this PR? Will it cause problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

largest_vnode_in_current_partition: VirtualNode::MAX.to_index()

the default value of largest_vnode_in_current_partition is VirtualNode::MAX.to_index(), it wil not cause problem even if we set the split_weight_by_vnode to 0 or 1 .

In short, we can turn off this feature by setting vnode_partition_count to 0

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM!

// 1. throughput and size are checked periodically and calculated according to the rules
// 2. A new group is created (split)
// 3. split_weight_by_vnode is modified for an existing group. (not supported yet)
group_to_table_vnode_partition:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also document what does it mean if:

  • a table id is not found in this mapping
  • a table id is found in this mapping with 0 as the vnode partition number
  • a table id is found in this mapping with 1 as the vnode partition numner

@Li0k Li0k enabled auto-merge December 1, 2023 08:11
@Li0k Li0k added this pull request to the merge queue Dec 1, 2023
Merged via the queue into main with commit 26bd9ef Dec 1, 2023
28 of 29 checks passed
@Li0k Li0k deleted the li0k/storage_cut_sst_file branch December 1, 2023 08:50
Li0k added a commit that referenced this pull request Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants