-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store per-vnode watermark in HummockVersion #13148
Comments
After reading #13429 I feel like the implementation is more complicated than I thought. Now I am suspecting that is it really good to introduce the concept "watermark" to Hummock. 🤔 It's like a specialized version of range delete, so I am thinking whether it's possible to do some little optimization on range delete? |
The current code in #13429 includes only two simple things: adding a an extra field to |
Sorry, my statement is too vague. The part I feel "more complicated than I thought" is the additional arg fn seal_current_epoch(&mut self, next_epoch: u64, opts: SealCurrentEpochOptions); Previously, the watermark is just a computation concept, even though it's implemented in the
I am thinking that, as your major motivation is to utilize the property of watermark: "we first search the latest watermark earlier than the given epoch of the vnode.", Can we apply this optimization to range deletes? By the way, temporal filter is actually orthogonal with watermark. Temporal filter deletes records because |
Having an extra opts is unavoidable if we want to pass more information other than a simple
Actually the term The current range delete implementation is for general purpose range delete. Its range delete tombstones are mixed into each SST, and the original range delete tombstones passed from compute layer is split by SST key range in compaction and it's hard to let the current range delete implementation to apply optimization with the broken tombstones. The main purpose of this proposal is actually reimplementing a much lighter-weight range delete. Many problems we meet in the current implementation can be fixed easily and elegantly. |
Is your feature request related to a problem? Please describe.
Currently in temporal filter and kv log store, we expect light-weight state cleaning instead of deleting the kv entries one by one. To support this feature, we implement a general purpose range delete in our storage, which handles the whole range within a table and we can write general delete range tombstones like the following:
To support such general purpose range delete, we have introduced considerable complexity in hummock, including complex read logic, extra compaction policy and instability in runtime, and also the delete ranges tombstones are occupying in-ignorable storage size in the sst meta file.
However, our current use case for state cleaning does not depend on a general purpose delete range. The current state cleaning only requires a per vnode monotonically increasing watermark. Therefore, if we can maintain a per vnode watermark in
HummockVersion
to support a light-weight state cleaning.The extra information will be like
Describe the solution you'd like
Metadata change
In
HummockVersion
, we store an extra per vnode watermark(has the semantic as lowest key in vnode).Support for MVCC read
To support MVCC, such per vnode watermark should be maintained for each epoch between the max committed epoch and the safe epoch.
When serving MVCC read, if the table has watermark defined in hummock version, assume that the read is within a single vnode, then we first search the latest watermark earlier than the given epoch of the vnode.
get
request, if the key is below the watermark, we returnNone
directly, and otherwise follows the original logic.iter
request, if the left bound of the key range is below the watermark, we change the left bound to the watermark, and then follow the original logic.Compaction
The per vnode watermark will be passed to compaction tasks.
In the compaction runner, if a key is below the watermark of the corresponding epoch, the key will be skipped and will not be included in the output sst.
vnode compression
In our use case, the vnodes within a parallelism might share the same watermark, and then we don't need to store a watermark for each vnode, and instead we can mark that a group of vnodes shares a same watermark. The size of metadata can be reduced by this way.
storage interface change
In the
seal_epoch
ofStateStore
or theseal_current_epoch
ofLocalStateStore
, we can add a new parameter to optionally pass the per vnode watermark to storage.Describe alternatives you've considered
No response
Additional context
This proposal does not conflict with the current range delete implementation.
It can serve as an extra lighter way to write range delete.
The text was updated successfully, but these errors were encountered: