-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): optimize data alignment for default compaction group #13075
Changes from 9 commits
62a9cdf
e3e51d0
9893724
0e21215
98779dd
3ef2635
e2d79f8
1003115
9e2c49e
3e0959c
c0e1121
e58d6f1
b8a664e
6b56485
ec2210b
6f2b69e
3833fe7
10f960d
840dca3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,9 +33,10 @@ backend = "Mem" | |
periodic_space_reclaim_compaction_interval_sec = 3600 | ||
periodic_ttl_reclaim_compaction_interval_sec = 1800 | ||
periodic_tombstone_reclaim_compaction_interval_sec = 600 | ||
periodic_split_compact_group_interval_sec = 180 | ||
periodic_split_compact_group_interval_sec = 10 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this change expected? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, I think this check should be more sensitive (for backfill), what do you think? |
||
move_table_size_limit = 10737418240 | ||
split_group_size_limit = 68719476736 | ||
cut_table_size_limit = 1073741824 | ||
do_not_config_object_storage_lifecycle = false | ||
partition_vnode_count = 64 | ||
table_write_throughput_threshold = 16777216 | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -80,7 +80,8 @@ use crate::hummock::metrics_utils::{ | |
}; | ||
use crate::hummock::{CompactorManagerRef, TASK_NORMAL}; | ||
use crate::manager::{ | ||
CatalogManagerRef, ClusterManagerRef, FragmentManagerRef, IdCategory, MetaSrvEnv, META_NODE_ID, | ||
CatalogManagerRef, ClusterManagerRef, FragmentManagerRef, IdCategory, MetaSrvEnv, TableId, | ||
META_NODE_ID, | ||
}; | ||
use crate::model::{ | ||
BTreeMapEntryTransaction, BTreeMapTransaction, ClusterId, MetadataModel, ValTransaction, | ||
|
@@ -144,6 +145,9 @@ pub struct HummockManager { | |
// `compaction_state` will record the types of compact tasks that can be triggered in `hummock` | ||
// and suggest types with a certain priority. | ||
pub compaction_state: CompactionState, | ||
|
||
group_to_table_vnode_partition: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you also document what does it mean if:
|
||
parking_lot::RwLock<HashMap<CompactionGroupId, BTreeMap<TableId, u32>>>, | ||
Comment on lines
+160
to
+161
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm... This is always mutated when we hold the compaction lock. Why not put it under |
||
} | ||
|
||
pub type HummockManagerRef = Arc<HummockManager>; | ||
|
@@ -390,6 +394,7 @@ impl HummockManager { | |
history_table_throughput: parking_lot::RwLock::new(HashMap::default()), | ||
compactor_streams_change_tx, | ||
compaction_state: CompactionState::new(), | ||
group_to_table_vnode_partition: parking_lot::RwLock::new(HashMap::default()), | ||
}; | ||
let instance = Arc::new(instance); | ||
instance.start_worker(rx).await; | ||
|
@@ -813,6 +818,14 @@ impl HummockManager { | |
// lock in compaction_guard, take out all table_options in advance there may be a | ||
// waste of resources here, need to add a more efficient filter in catalog_manager | ||
let all_table_id_to_option = self.catalog_manager.get_all_table_options().await; | ||
let mut table_to_vnode_partition = match self | ||
.group_to_table_vnode_partition | ||
.read() | ||
.get(&compaction_group_id) | ||
{ | ||
Some(table_to_vnode_partition) => table_to_vnode_partition.clone(), | ||
None => BTreeMap::default(), | ||
}; | ||
|
||
let mut compaction_guard = write_lock!(self, compaction).await; | ||
let compaction = compaction_guard.deref_mut(); | ||
|
@@ -903,6 +916,7 @@ impl HummockManager { | |
.unwrap() | ||
.member_table_ids | ||
.clone(); | ||
|
||
Li0k marked this conversation as resolved.
Show resolved
Hide resolved
|
||
let is_trivial_reclaim = CompactStatus::is_trivial_reclaim(&compact_task); | ||
let is_trivial_move = CompactStatus::is_trivial_move_task(&compact_task); | ||
|
||
|
@@ -963,6 +977,18 @@ impl HummockManager { | |
compact_task.current_epoch_time = Epoch::now().0; | ||
compact_task.compaction_filter_mask = | ||
group_config.compaction_config.compaction_filter_mask; | ||
table_to_vnode_partition | ||
.retain(|table_id, _| compact_task.existing_table_ids.contains(table_id)); | ||
Li0k marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
if group_config.compaction_config.split_weight_by_vnode > 0 { | ||
// TODO: unify to table_to_vnode_partition directly | ||
table_to_vnode_partition.insert( | ||
member_table_ids[0], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is an implicit assumption here: there is only a single item in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of updating table_to_vnode_partition ad-hoc in
Maybe that is what you mean by "// TODO: unify to table_to_vnode_partition directly"? |
||
group_config.compaction_config.split_weight_by_vnode, | ||
); | ||
} | ||
|
||
compact_task.table_vnode_partition = table_to_vnode_partition; | ||
|
||
let mut compact_task_assignment = | ||
BTreeMapTransaction::new(&mut compaction.compact_task_assignment); | ||
|
@@ -2164,7 +2190,6 @@ impl HummockManager { | |
)); | ||
split_group_trigger_interval | ||
.set_missed_tick_behavior(tokio::time::MissedTickBehavior::Delay); | ||
split_group_trigger_interval.reset(); | ||
|
||
let split_group_trigger = IntervalStream::new(split_group_trigger_interval) | ||
.map(|_| HummockTimerEvent::GroupSplit); | ||
|
@@ -2484,31 +2509,60 @@ impl HummockManager { | |
let mv_group_id: CompactionGroupId = StaticCompactionGroupId::MaterializedView.into(); | ||
let partition_vnode_count = self.env.opts.partition_vnode_count; | ||
let window_size = HISTORY_TABLE_INFO_STATISTIC_TIME / (checkpoint_secs as usize); | ||
|
||
let mut group_to_table_vnode_partition = HashMap::default(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We will clear group_to_table_vnode_partition and re-calculate it. Just to double check: do we guarantee that |
||
|
||
for group in &group_infos { | ||
let table_vnode_partition_mapping = group_to_table_vnode_partition | ||
.entry(group.group_id) | ||
.or_insert(BTreeMap::default()); | ||
|
||
if group.table_statistic.len() == 1 { | ||
// no need to handle the dedication compaciton group | ||
continue; | ||
} | ||
|
||
for (table_id, table_size) in &group.table_statistic { | ||
if !created_tables.contains(table_id) { | ||
continue; | ||
} | ||
let mut is_high_write_throughput = false; | ||
let mut is_low_write_throughput = true; | ||
let is_creating_table = !created_tables.contains(table_id); | ||
if let Some(history) = table_write_throughput.get(table_id) { | ||
if history.len() >= window_size { | ||
is_high_write_throughput = history.iter().all(|throughput| { | ||
*throughput / checkpoint_secs | ||
> self.env.opts.table_write_throughput_threshold | ||
}); | ||
is_low_write_throughput = history.iter().any(|throughput| { | ||
*throughput / checkpoint_secs | ||
< self.env.opts.min_table_split_write_throughput | ||
}); | ||
if !is_creating_table { | ||
if history.len() >= window_size { | ||
is_high_write_throughput = history.iter().all(|throughput| { | ||
*throughput / checkpoint_secs | ||
> self.env.opts.table_write_throughput_threshold | ||
}); | ||
is_low_write_throughput = history.iter().any(|throughput| { | ||
*throughput / checkpoint_secs | ||
< self.env.opts.min_table_split_write_throughput | ||
}); | ||
} | ||
} else { | ||
let sum = history.iter().sum::<u64>(); | ||
is_low_write_throughput = sum | ||
< self.env.opts.min_table_split_write_throughput | ||
* history.len() as u64 | ||
* checkpoint_secs; | ||
} | ||
} | ||
let state_table_size = *table_size; | ||
|
||
{ | ||
if !is_low_write_throughput { | ||
table_vnode_partition_mapping.insert(*table_id, 4_u32); | ||
} else if state_table_size > self.env.opts.cut_table_size_limit { | ||
table_vnode_partition_mapping.insert(*table_id, 1_u32); | ||
Li0k marked this conversation as resolved.
Show resolved
Hide resolved
Li0k marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} else { | ||
table_vnode_partition_mapping.remove(table_id); | ||
} | ||
} | ||
|
||
if !created_tables.contains(table_id) { | ||
Li0k marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nits: this can be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should be is_creating_table |
||
// do not split the creating table | ||
continue; | ||
} | ||
|
||
if is_low_write_throughput { | ||
continue; | ||
} | ||
|
@@ -2532,6 +2586,7 @@ impl HummockManager { | |
} | ||
} | ||
|
||
table_vnode_partition_mapping.insert(*table_id, partition_vnode_count); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. vnode parition mapping set in L2553 can be overwritten here. Is this expected? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, I'm going to remove it, currently we'll set it up when we build compact_task |
||
let ret = self | ||
.move_state_table_to_compaction_group( | ||
parent_group_id, | ||
|
@@ -2557,6 +2612,13 @@ impl HummockManager { | |
} | ||
} | ||
} | ||
|
||
tracing::info!( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would this log be too frequently? I think we do not need to print it every times There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. change it to debug, we can provide a separate interface to get table_vnode_partition if needed |
||
"group_to_table_vnode_partition {:?}", | ||
group_to_table_vnode_partition | ||
); | ||
|
||
*self.group_to_table_vnode_partition.write() = group_to_table_vnode_partition; | ||
} | ||
|
||
#[named] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they deprecated now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No in this PR, the new
table_vnode_partition
is not persistentThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not persistent? Are
split_byt_state_table
andsplit_weight_by_vnode
persistent? Why does persistence affect the deprecation of these fields?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, we should document what fields in CompactTask are persistent and what fields are not.