Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(meta): trivial move bug #15659

Merged

Conversation

Little-Wallace
Copy link
Contributor

@Little-Wallace Little-Wallace commented Mar 13, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

A similar bug like #11651 but it happens in level 0.

pick_whole_level will pick a wrong task so we must ban concurrent trivial move task.
For example, a task try to compact sstable files of range ["aa", "ac") and sstable of range ["cc", "cd") in sub-level 3 to target sub level 2, overlapping with sstable files of range ["aa", "ac") and range ["cc", "cd"), and then this task generates a sstable file with range ["aa", "cd").
But before this task finished, another sstable file with range ["bb", "bc") are compacted to target sub level 3 from sub level 4 by trivial-move, which does not overlap with the input files of task mentioned above. And then this file are compacted to target sub level 2 from sub level 3 by trivial-move again, which does not overlap with the input files in sub level 2 of task mentioned above, too.
When meta-node try to apply the result of compaction task, the sstable file with range ["aa", "cd") to target sub level 2, it will find there has been another file with range ["bb", "bc") overlapping with this result and crush to panic.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Signed-off-by: Little-Wallace <[email protected]>
@Li0k Li0k self-requested a review March 14, 2024 07:51
@@ -272,18 +276,6 @@ impl IntraCompactionPicker {
assert!(overlap
.check_multiple_overlap(&l0.sub_levels[idx].table_infos)
.is_empty());
let mut target_level_idx = idx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we're sure this has no impact, keep the comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It may cause some bug...
And I think this operation is unsed after #15523

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be more specific :)

No. It may cause some bug...

What is the bug? Is it the same bug in this issue? I think the intention of the change here is to disallow trivial move for more than one level.

And I think this operation is unsed after #15523

Why is the operation here relevant to #15523. I though #15523 is just a batching optimization to amortize the cost of ETCD I/O and memory copy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if we can batch several task in once transaction, we do not need to move sstable file across several sub levels.
I think this operation may be dangerous

@@ -1056,11 +1056,10 @@ impl HummockManager {
} else {
table_to_vnode_partition
.retain(|table_id, _| compact_task.existing_table_ids.contains(table_id));
if group_config.compaction_config.split_weight_by_vnode > 0 {
if compact_task.existing_table_ids.len() == 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will break the partition of hybrid cg like cg2, cg3. why should we limit the existing_table_ids size ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, I will fix it

@Li0k
Copy link
Contributor

Li0k commented Mar 14, 2024

PTAL @hzxa21

@Li0k Li0k requested a review from hzxa21 March 14, 2024 08:32
Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realize that none of this PR, #11651 and #11653 describe what the bug is. Please create an issue or use PR description to describe the bug (better with an example). For such a subtle bug, it will be hard to memorize in the future without some notes.

It is better to write some comments (with reference to the issue/PR) in the relevant codes (e.g. those if-continue arms)

Comment on lines +246 to +248
if level_handlers[0].is_level_pending_compact(level) {
continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check sufficient? Is it possible that there is a race between a trivial move task and a regular task both outputting to a target sub-level with no SSTs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand, why there would be race?
If some regular task try to compact target level x from y, the trivial move task to sub level either x or y would be prevent.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand, why there would be race? If some regular task try to compact target level x from y, the trivial move task to sub level either x or y would be prevent.

Is it possible to has a regular task trying to compact level x, x+1, x+2, x+3 to y and the trivial move task trying to compact task from x+1 to x+2 while x+2 is empty?

Signed-off-by: Little-Wallace <[email protected]>
Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hzxa21
Copy link
Collaborator

hzxa21 commented Mar 18, 2024

I just realize that none of this PR, #11651 and #11653 describe what the bug is. Please create an issue or use PR description to describe the bug (better with an example). For such a subtle bug, it will be hard to memorize in the future without some notes.

It is better to write some comments (with reference to the issue/PR) in the relevant codes (e.g. those if-continue arms)

Don't forget this.

@Little-Wallace Little-Wallace added this pull request to the merge queue Mar 19, 2024
Merged via the queue into risingwavelabs:main with commit cb4228a Mar 19, 2024
26 of 27 checks passed
@Little-Wallace Little-Wallace deleted the wallace/fix-trivial-move branch March 19, 2024 03:54
Little-Wallace added a commit to Little-Wallace/risingwave that referenced this pull request Mar 19, 2024
@Little-Wallace Little-Wallace mentioned this pull request Mar 19, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants