feat(storage): optimize data alignment for default compaction group #13075

Li0k · 2023-10-26T08:43:07Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

#13037

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

…nto li0k/storage_cut_sst_file

gitguardian · 2023-10-31T09:55:26Z

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	Secret	Commit	Filename
7648795	Generic CLI Secret	`0e21215`	integration_tests/iceberg-cdc/docker-compose.yml	View secret
7648795	Generic CLI Secret	`0e21215`	integration_tests/iceberg-cdc/run_test.sh	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!}

codecov · 2023-10-31T11:45:59Z

Codecov Report

Attention: 167 lines in your changes are missing coverage. Please review.

Comparison is base (da79ff5) 68.15% compared to head (840dca3) 68.22%.
Report is 1 commits behind head on main.

Files	Patch %	Lines
src/meta/src/hummock/manager/mod.rs	29.22%	155 Missing ⚠️
src/common/src/array/arrow.rs	69.23%	8 Missing ⚠️
src/meta/node/src/lib.rs	0.00%	2 Missing ⚠️
src/meta/src/manager/catalog/mod.rs	0.00%	1 Missing ⚠️
...age/src/hummock/compactor/shared_buffer_compact.rs	88.88%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #13075      +/-   ##
==========================================
+ Coverage   68.15%   68.22%   +0.06%     
==========================================
  Files        1524     1524              
  Lines      262331   262492     +161     
==========================================
+ Hits       178793   179075     +282     
+ Misses      83538    83417     -121

Flag	Coverage Δ
rust	`68.22% <57.39%> (+0.06%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Little-Wallace · 2023-11-01T04:54:44Z

src/storage/src/hummock/sstable/multi_builder.rs

+                self.table_partition_vnode.get(&user_key.table_id.table_id);
+
+            if new_vnode_partition_count.is_some()
+                || self.table_partition_vnode.contains_key(&self.last_table_id)


why check self.table_partition_vnode.contains_key(&self.last_table_id) ?

Same question here.

example:

table_partition_vnode = [(1, 1)] sst_1 (table_id 1) -------- builded sst_2 (table_id 1) -------- building add full_key (table_id 2), due to table_id = 1 in table_partition_vnode, we should finish the sst_2 first and create the new sst_3 for table_id 2 that makeing the better data alignment. after add_full_key sst_1 (table_id 1) -------- builded sst_2 (table_id 1) -------- builded sst_3 (table_id 2) -------- building

src/meta/src/hummock/manager/mod.rs

src/storage/src/hummock/compactor/compaction_utils.rs

hzxa21 · 2023-11-01T09:46:51Z

We need to run some perf benchmark to validate the gain of this optimization. As discussed in the meeting, we can either use the mirror cluster or use datagen to mimic the workload.

hzxa21 · 2023-11-02T05:16:25Z

This PR basically makes the cut by table and vnode partition dynamic for creating tables. Random thought: if we apply this mechanism to all tables, do we still need compaction group split? What extra benefic can compaction group split bring?

…nto li0k/storage_cut_sst_file

hzxa21 · 2023-11-15T03:58:05Z

src/config/example.toml

@@ -33,9 +33,10 @@ backend = "Mem"
 periodic_space_reclaim_compaction_interval_sec = 3600
 periodic_ttl_reclaim_compaction_interval_sec = 1800
 periodic_tombstone_reclaim_compaction_interval_sec = 600
-periodic_split_compact_group_interval_sec = 180
+periodic_split_compact_group_interval_sec = 10


Is this change expected?

yes, I think this check should be more sensitive (for backfill), what do you think?

hzxa21 · 2023-11-15T04:00:50Z

src/meta/src/hummock/manager/mod.rs

+    group_to_table_vnode_partition:
+        parking_lot::RwLock<HashMap<CompactionGroupId, BTreeMap<TableId, u32>>>,


Hmm... This is always mutated when we hold the compaction lock. Why not put it under Compaction?

src/meta/src/hummock/manager/mod.rs

hzxa21 · 2023-11-15T04:07:12Z

src/meta/src/hummock/manager/mod.rs

@@ -2532,6 +2586,7 @@ impl HummockManager {
                    }
                }

+                table_vnode_partition_mapping.insert(*table_id, partition_vnode_count);


vnode parition mapping set in L2553 can be overwritten here. Is this expected?

Okay, I'm going to remove it, currently we'll set it up when we build compact_task

hzxa21 · 2023-11-15T04:09:44Z

proto/hummock.proto

  bool split_by_state_table = 21;
  // Compaction needs to cut the state table every time 1/weight of vnodes in the table have been processed.
  uint32 split_weight_by_vnode = 22;


Are they deprecated now?

No in this PR, the new table_vnode_partition is not persistent

Why is it not persistent? Are split_byt_state_table and split_weight_by_vnode persistent? Why does persistence affect the deprecation of these fields?

Btw, we should document what fields in CompactTask are persistent and what fields are not.

hzxa21 · 2023-11-15T04:10:42Z

Benchmark for this PR: #13037 (comment)

Little-Wallace · 2023-11-15T08:01:50Z

src/meta/src/hummock/manager/mod.rs

@@ -2557,6 +2612,13 @@ impl HummockManager {
                }
            }
        }
+
+        tracing::info!(


would this log be too frequently? I think we do not need to print it every times

change it to debug, we can provide a separate interface to get table_vnode_partition if needed

Little-Wallace · 2023-11-17T04:36:31Z

This PR basically makes the cut by table and vnode partition dynamic for creating tables. Random thought: if we apply this mechanism to all tables, do we still need compaction group split? What extra benefic can compaction group split bring?

When we split one table to an independent group. it means that we caculate the score of its data independently, and we also use a larger base level size for it.

…nto li0k/storage_cut_sst_file

Little-Wallace

LGTM

hzxa21 · 2023-11-23T02:35:51Z

proto/hummock.proto

  bool split_by_state_table = 21;
  // Compaction needs to cut the state table every time 1/weight of vnodes in the table have been processed.
  uint32 split_weight_by_vnode = 22;


Why is it not persistent? Are split_byt_state_table and split_weight_by_vnode persistent? Why does persistence affect the deprecation of these fields?

hzxa21 · 2023-11-23T02:36:27Z

proto/hummock.proto

  bool split_by_state_table = 21;
  // Compaction needs to cut the state table every time 1/weight of vnodes in the table have been processed.
  uint32 split_weight_by_vnode = 22;


Btw, we should document what fields in CompactTask are persistent and what fields are not.

hzxa21 · 2023-11-23T02:40:32Z

src/meta/src/hummock/manager/mod.rs

+                    }
+                }
+
+                if !created_tables.contains(table_id) {


nits: this can be !is_creating_table

Should be is_creating_table

hzxa21 · 2023-11-23T02:46:53Z

src/meta/src/hummock/manager/mod.rs

+                    // When in a hybrid compaction group, data from multiple state tables may exist in a single sst, and in order to make the data in the sub level more aligned, a proactive cut is made for the data.
+                    // https://github.com/risingwavelabs/risingwave/issues/13037
+                    // 1. In a backfill scenario, the state_table / mv may have high write throughput. Therefore, we relax the limit of `is_low_write_throughput` and partition the table with high write throughput by vnode to improve the parallel efficiency of compaction.
+                    // 2. For table with low throughput, partition by table_id to minimize amplification.
+                    // 3. When the write mode is changed (the above conditions are not met), the default behavior is restored
+                    if !is_low_write_throughput {
+                        table_vnode_partition_mapping.insert(*table_id, hybrid_vnode_count);
+                    } else if state_table_size > self.env.opts.cut_table_size_limit {
+                        table_vnode_partition_mapping.insert(*table_id, SPLIT_BY_TABLE);
+                    } else {
+                        //
+                        table_vnode_partition_mapping.remove(table_id);
+                    }
+                }


The comments here indicate that case 1 is only for backfill (i.e. creating table ids) but in the codes it applies to created table ids as well because it is not guarded by if is_creating_table. Is this expected?

Its expected,
What I'm trying to explain is that tables that meet this condition are more likely to appear in backfill scenarios, since hummock currently prohibits creating tables from being split.

In short, we allow both the creating / created table to match this rule, and when created meets the split condition, it's more likely to be split away, I've changed the comment to make it easier to understand.

hzxa21 · 2023-11-23T02:56:17Z

src/meta/src/hummock/manager/mod.rs

+            if group_config.compaction_config.split_weight_by_vnode > 0 {
+                // TODO: unify to table_to_vnode_partition directly
+                table_to_vnode_partition.insert(
+                    member_table_ids[0],


There is an implicit assumption here: there is only a single item in member_table_ids when the if condition is met. We should add an assertion to check or throw an error if the assumption doesn't hold.

Instead of updating table_to_vnode_partition ad-hoc in get_compact_task, I think it is more appropriate to update self.group_to_table_vnode_partition correspondingly in the following cases:

A new group is created.

split_weight_by_vnode is modified for an existing group.

New table ids are added to a group.

Maybe that is what you mean by "// TODO: unify to table_to_vnode_partition directly"?

hzxa21 · 2023-11-23T03:18:47Z

src/meta/src/hummock/manager/mod.rs

        let window_size = HISTORY_TABLE_INFO_STATISTIC_TIME / (checkpoint_secs as usize);
+
+        let mut group_to_table_vnode_partition = HashMap::default();


We will clear group_to_table_vnode_partition and re-calculate it. Just to double check: do we guarantee that group_infos will contain all compaction groups and all table ids?

src/meta/src/hummock/manager/mod.rs

src/common/src/config.rs

src/meta/src/hummock/manager/mod.rs

zwang28 · 2023-11-24T12:12:21Z

src/storage/src/hummock/compactor/shared_buffer_compact.rs

@@ -230,13 +230,21 @@ async fn compact_shared_buffer(
    let mut compaction_futures = vec![];
    let use_block_based_filter = BlockedXor16FilterBuilder::is_kv_count_too_large(total_key_count);

+    let table_vnode_partition = if existing_table_ids.len() == 1 {
+        let table_id = existing_table_ids.iter().next().unwrap();
+        vec![(*table_id, split_weight_by_vnode as u32)]


Should we consider split_weight_by_vnode==0, or 1?

zwang28 · 2023-11-24T12:13:06Z

src/storage/src/hummock/sstable/multi_builder.rs

@@ -206,7 +198,11 @@ where
                if switch_builder {
                    need_seal_current = true;
                } else if builder.reach_capacity() {
-                    if self.split_weight_by_vnode == 0 || builder.reach_max_sst_size() {
+                    let is_split_table = self


Value can be 0.
There may be other places where corner cases split_weight_by_vnode is 0 or 1 happen.

In the new version, contains_key is used to identify whether partition needs to be performed on the table, while the old version sets split_weight_by_vnode = 0 for the default group (which do not partition the tables that in default group).

zwang28 · 2023-11-24T12:24:35Z

src/storage/src/hummock/sstable/multi_builder.rs

+                        VirtualNode::COUNT / (self.split_weight_by_vnode as usize) - 1;
+                } else {
+                    // default
+                    self.largest_vnode_in_current_partition = VirtualNode::MAX.to_index();


Did we forget to set largest_vnode_in_current_partition when split_weight_by_vnode is 1 before this PR? Will it cause problem?

largest_vnode_in_current_partition: VirtualNode::MAX.to_index()

the default value of largest_vnode_in_current_partition is VirtualNode::MAX.to_index(), it wil not cause problem even if we set the split_weight_by_vnode to 0 or 1 .

In short, we can turn off this feature by setting vnode_partition_count to 0

…nto li0k/storage_cut_sst_file

hzxa21

Rest LGTM!

hzxa21 · 2023-11-30T06:14:11Z

src/meta/src/hummock/manager/mod.rs

+    // 1. throughput and size are checked periodically and calculated according to the rules
+    // 2. A new group is created (split)
+    // 3. split_weight_by_vnode is modified for an existing group. (not supported yet)
+    group_to_table_vnode_partition:


Can you also document what does it mean if:

a table id is not found in this mapping

a table id is found in this mapping with 0 as the vnode partition number

a table id is found in this mapping with 1 as the vnode partition numner

…nto li0k/storage_cut_sst_file

… group (#13075)" This reverts commit 26bd9ef.

Li0k added 2 commits October 26, 2023 16:41

feat(storage): optimize data alignment for default compaction group

62a9cdf

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

e3e51d0

…nto li0k/storage_cut_sst_file

github-actions bot added the type/feature label Oct 26, 2023

Li0k added 2 commits October 31, 2023 17:54

chore(storage): remove split by table

9893724

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

0e21215

…nto li0k/storage_cut_sst_file

Li0k marked this pull request as ready for review October 31, 2023 09:55

fix(storage): fix test

98779dd

Little-Wallace reviewed Nov 1, 2023

View reviewed changes

src/meta/src/hummock/manager/mod.rs Outdated Show resolved Hide resolved

Little-Wallace reviewed Nov 1, 2023

View reviewed changes

src/storage/src/hummock/compactor/compaction_utils.rs Outdated Show resolved Hide resolved

fix(storage): fix comments

3ef2635

Li0k force-pushed the li0k/storage_cut_sst_file branch from 3c38e42 to 3ef2635 Compare November 1, 2023 11:45

hzxa21 mentioned this pull request Nov 8, 2023

perf(storage): Improve data alignment for multi-table compaction groups #13037

Open

Li0k added 3 commits November 13, 2023 17:06

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

e2d79f8

…nto li0k/storage_cut_sst_file

fix(storage): update branch

1003115

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

9e2c49e

…nto li0k/storage_cut_sst_file

hzxa21 reviewed Nov 15, 2023

View reviewed changes

Little-Wallace reviewed Nov 15, 2023

View reviewed changes

Li0k added 2 commits November 20, 2023 14:57

chore(storage): address comments

3e0959c

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

c0e1121

…nto li0k/storage_cut_sst_file

Li0k requested a review from zwang28 November 20, 2023 07:21

Li0k added 2 commits November 20, 2023 15:24

fix(storage): fix unit-test

e58d6f1

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

b8a664e

…nto li0k/storage_cut_sst_file

Little-Wallace approved these changes Nov 22, 2023

View reviewed changes

hzxa21 reviewed Nov 23, 2023

View reviewed changes

src/meta/src/hummock/manager/mod.rs Show resolved Hide resolved

hzxa21 reviewed Nov 23, 2023

View reviewed changes

src/meta/src/hummock/manager/mod.rs Show resolved Hide resolved

hzxa21 reviewed Nov 23, 2023

View reviewed changes

src/common/src/config.rs Show resolved Hide resolved

hzxa21 reviewed Nov 23, 2023

View reviewed changes

src/meta/src/hummock/manager/mod.rs Outdated Show resolved Hide resolved

zwang28 reviewed Nov 24, 2023

View reviewed changes

Li0k added 3 commits November 27, 2023 20:52

refactor(storage): refactor code and fix comments

6b56485

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

ec2210b

…nto li0k/storage_cut_sst_file

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

6f2b69e

…nto li0k/storage_cut_sst_file

hzxa21 approved these changes Nov 30, 2023

View reviewed changes

Li0k added 3 commits November 30, 2023 16:08

chore(storage): add docs

3833fe7

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

10f960d

…nto li0k/storage_cut_sst_file

Merge branch 'main' of https://github.com/risingwavelabs/risingwave i…

840dca3

…nto li0k/storage_cut_sst_file

Li0k enabled auto-merge December 1, 2023 08:11

Li0k added this pull request to the merge queue Dec 1, 2023

Merged via the queue into main with commit 26bd9ef Dec 1, 2023
28 of 29 checks passed

Li0k deleted the li0k/storage_cut_sst_file branch December 1, 2023 08:50

xxchan mentioned this pull request Dec 4, 2023

nexmark q19 and q20 blackhole performance degradation on affinity testbed between 1130-1203 #13773

Closed

Li0k added a commit that referenced this pull request Dec 4, 2023

Revert "feat(storage): optimize data alignment for default compaction…

db02870

… group (#13075)" This reverts commit 26bd9ef.

Little-Wallace mentioned this pull request May 8, 2024

feat(meta): split by table according write throughput #15547

Merged

9 tasks

BugenZhao mentioned this pull request Oct 14, 2024

storage: support variable vnode count in compation #18888

Closed

		group_to_table_vnode_partition:
		parking_lot::RwLock<HashMap<CompactionGroupId, BTreeMap<TableId, u32>>>,

		let window_size = HISTORY_TABLE_INFO_STATISTIC_TIME / (checkpoint_secs as usize);

		let mut group_to_table_vnode_partition = HashMap::default();

feat(storage): optimize data alignment for default compaction group #13075

feat(storage): optimize data alignment for default compaction group #13075

Conversation

Li0k commented Oct 26, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Release note

gitguardian bot commented Oct 31, 2023 • edited Loading

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

codecov bot commented Oct 31, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Li0k Nov 13, 2023 • edited Loading

Choose a reason for hiding this comment

hzxa21 commented Nov 1, 2023

hzxa21 commented Nov 2, 2023

Choose a reason for hiding this comment

Li0k Nov 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzxa21 commented Nov 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Little-Wallace commented Nov 17, 2023

Little-Wallace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zwang28 Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

zwang28 Nov 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzxa21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Li0k commented Oct 26, 2023 •

edited

Loading

gitguardian bot commented Oct 31, 2023 •

edited

Loading

codecov bot commented Oct 31, 2023 •

edited

Loading

Li0k Nov 13, 2023 •

edited

Loading

Li0k Nov 20, 2023 •

edited

Loading

zwang28 Nov 24, 2023 •

edited

Loading

zwang28 Nov 24, 2023 •

edited

Loading