Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store per-vnode watermark in HummockVersion #13148

Closed
wenym1 opened this issue Oct 30, 2023 · 5 comments
Closed

Store per-vnode watermark in HummockVersion #13148

wenym1 opened this issue Oct 30, 2023 · 5 comments
Assignees
Milestone

Comments

@wenym1
Copy link
Contributor

wenym1 commented Oct 30, 2023

Is your feature request related to a problem? Please describe.

Currently in temporal filter and kv log store, we expect light-weight state cleaning instead of deleting the kv entries one by one. To support this feature, we implement a general purpose range delete in our storage, which handles the whole range within a table and we can write general delete range tombstones like the following:

epoch1: (vnode1-wtm1, vnode1-wtm2), (vnode2-wtm1, vnode2-wtm2), ...
epoch2: (vnode1-wtm2, vnode1-wtm3), (vnode2-wtm2, vnode2-wtm3), ...
...

To support such general purpose range delete, we have introduced considerable complexity in hummock, including complex read logic, extra compaction policy and instability in runtime, and also the delete ranges tombstones are occupying in-ignorable storage size in the sst meta file.

However, our current use case for state cleaning does not depend on a general purpose delete range. The current state cleaning only requires a per vnode monotonically increasing watermark. Therefore, if we can maintain a per vnode watermark in HummockVersion to support a light-weight state cleaning.

The extra information will be like

epoch1: (vnode1-wtm1), (vnode2-wtm1), ...
epoch2: (vnode1-wtm2), (vnode2-wtm2), ...
...

Describe the solution you'd like

Metadata change

In HummockVersion, we store an extra per vnode watermark(has the semantic as lowest key in vnode).

Support for MVCC read

To support MVCC, such per vnode watermark should be maintained for each epoch between the max committed epoch and the safe epoch.

safe epoch
epoch1: (vnode1-wtm1), (vnode2-wtm1), ...
epoch2: (vnode1-wtm2), (vnode2-wtm2), ...
...
max committed epoch

When serving MVCC read, if the table has watermark defined in hummock version, assume that the read is within a single vnode, then we first search the latest watermark earlier than the given epoch of the vnode.

  • For a get request, if the key is below the watermark, we return None directly, and otherwise follows the original logic.
  • For a iter request, if the left bound of the key range is below the watermark, we change the left bound to the watermark, and then follow the original logic.

Compaction

The per vnode watermark will be passed to compaction tasks.

In the compaction runner, if a key is below the watermark of the corresponding epoch, the key will be skipped and will not be included in the output sst.

vnode compression

In our use case, the vnodes within a parallelism might share the same watermark, and then we don't need to store a watermark for each vnode, and instead we can mark that a group of vnodes shares a same watermark. The size of metadata can be reduced by this way.

storage interface change

In the seal_epoch of StateStore or the seal_current_epoch of LocalStateStore, we can add a new parameter to optionally pass the per vnode watermark to storage.

Describe alternatives you've considered

No response

Additional context

This proposal does not conflict with the current range delete implementation.

It can serve as an extra lighter way to write range delete.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 15, 2023

In the seal_epoch of StateStore or the seal_current_epoch of LocalStateStore, we can add a new parameter to optionally pass the per vnode watermark to storage.

After reading #13429 I feel like the implementation is more complicated than I thought. Now I am suspecting that is it really good to introduce the concept "watermark" to Hummock. 🤔 It's like a specialized version of range delete, so I am thinking whether it's possible to do some little optimization on range delete?

@wenym1
Copy link
Contributor Author

wenym1 commented Nov 16, 2023

After reading #13429 I feel like the implementation is more complicated than I thought. Now I am suspecting that is it really good to introduce the concept "watermark" to Hummock. 🤔 It's like a specialized version of range delete, so I am thinking whether it's possible to do some little optimization on range delete?

The current code in #13429 includes only two simple things: adding a an extra field to HummockVersion and collecting the watermark written by each parallelism. No complicated logic is included yet. Most of LOC in the PR are caused by unit test changes due to the change of parameter list of seal_current_epoch and commit_epoch, which are widely used in unit tests. The main logic is collecting the watermark along with SST metadata from lower level uploader all the way to meta node, which I think is simple and trivial.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 16, 2023

After reading #13429 I feel like the implementation is more complicated than I thought. Now I am suspecting that is it really good to introduce the concept "watermark" to Hummock. 🤔 It's like a specialized version of range delete, so I am thinking whether it's possible to do some little optimization on range delete?

The current code in #13429 includes only two simple things: adding a an extra field to HummockVersion and collecting the watermark written by each parallelism. No complicated logic is included yet. Most of LOC in the PR are caused by unit test changes due to the change of parameter list of seal_current_epoch and commit_epoch, which are widely used in unit tests. The main logic is collecting the watermark along with SST metadata from lower level uploader all the way to meta node, which I think is simple and trivial.

Sorry, my statement is too vague. The part I feel "more complicated than I thought" is the additional arg opts: SealCurrentEpochOptions in

fn seal_current_epoch(&mut self, next_epoch: u64, opts: SealCurrentEpochOptions);

Previously, the watermark is just a computation concept, even though it's implemented in the state_table layer in #6495, which I consider as an intersection area between compute and storage. While in this PR, it invades more deeply into the storage via this interface.

... so I am thinking whether it's possible to do some little optimization on range delete?

I am thinking that, as your major motivation is to utilize the property of watermark: "we first search the latest watermark earlier than the given epoch of the vnode.", Can we apply this optimization to range deletes?

By the way, temporal filter is actually orthogonal with watermark. Temporal filter deletes records because now() is monotonically increasing, not because of the advance of watermark, so it still relies on range deletes.

@wenym1
Copy link
Contributor Author

wenym1 commented Nov 16, 2023

The part I feel "more complicated than I thought" is the additional arg opts: SealCurrentEpochOptions

Having an extra opts is unavoidable if we want to pass more information other than a simple next_epoch: u64 when we receive a barrier. This extra opts args is a placeholder for a list of parameters so that in the future when we want to add a new field for the method, we won't involve too many LOC. In other storage method, we also have similar XxxOptions such as ReadOptions and InitOptions, to provide extra information when calling the methods.

I am thinking that, as your major motivation is to utilize the property of watermark: "we first search the latest watermark earlier than the given epoch of the vnode.", Can we apply this optimization to range deletes?
By the way, temporal filter is actually orthogonal with watermark. Temporal filter deletes records because now() is monotonically increasing, not because of the advance of watermark, so it still relies on range deletes.

Actually the term watermark in this PR is not the same thing as the Message::Watermark flowing through the stream graph. It's something monotonically increasing for state cleaning. If the term is confusing, we can rename it to another name, like KeyLowerBound, though the field related to state cleaning in StateTable is also xxx_watermark, and watermark is more natural to represent something monotonically increasing.

The current range delete implementation is for general purpose range delete. Its range delete tombstones are mixed into each SST, and the original range delete tombstones passed from compute layer is split by SST key range in compaction and it's hard to let the current range delete implementation to apply optimization with the broken tombstones. The main purpose of this proposal is actually reimplementing a much lighter-weight range delete. Many problems we meet in the current implementation can be fixed easily and elegantly.

@wenym1 wenym1 modified the milestones: release-1.5, release-1.6 Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants