-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve backfill on table with a large number of tombstones #12680
Comments
|
To elaborate more on why this is severe:
On each barrier, we will create a new snapshot_read by default. If iterator could not be created, we will restart the process of iterator creation again. So basically backfill will be stuck all the way, until compaction cleans up enough tombstone records, that iterator creation can work (which may take a long time, or it could never happen if data volume is too high). |
2 can only solve the issue under certain workload when the tombstone keys are accumulated at the beginning of the table, which is a subset of the issues. I prefer 1 too.
+1 |
I see, that can happen when we have multiple intervals of tombstones. For 1, here's what I'm thinking for implementation:
Within the backfill executor we have to be careful though. I'm not sure if we have to treat a If we do, perhaps we need some discriminant for whether a current_pos is a tombstone or not. |
It seems that "letting Backfill operator to iterate on each vnode at one moment" was not mentioned? I suppose this would be the simplest way to solve the problem. It didn't skip the tombstones (so the interface can be kept as it is now), but there will be only one iterator open. |
By the way, skipping tombstone is not specific to BackfillExecutor. Temporal filter / watermark are common pattern in streaming and all operator may encounter this problem. Thus, I don't like the idea:
|
Even if backfill iter on each vnode, it will still encounter the issue. Because it cannot maintain the progress it made in the previous epoch. |
It can be exposed as a state store interface, and other relational layer interfaces can use it. |
Yeah, I can understand your intention. But this is somehow dirty to me - the caller needs to understand more about the implementation of storage now, which makes it a bad abstraction. I recommend we don't do this unless necessary. How about try vnode-level iteration first? If that cannot solve the problem completely, we may try this afterwards. |
Note down the discussion with @kwannoel offline: Even if we did vnode-level iteration, there can be a bad case: the To avoid this, one of the solution is to introduce a Note that this approach is very specific to |
As long as we can settle on the interface, it is not hard for storage to expose "raw iter" since the currently exposed user iterator is just a wrapper on top of the storage internal "raw iter". |
Could this be a general optimization for iterating the storage? By maintaining the min key of a table (or even for each vnode prefix), a full scan can be converted to a range scan in the storage itself to skip those tombstone keys. 🤔 So the application layer does not have to care about this.. |
If we have an upstream MV and there are lots of deletes against it (especially from the temporal filter). These deletes will generate lots of tombstones in SSTs. Our compaction currently only actually delete those tombstone in the last level. Thinking about temporal filters only keeps the latest data in one day. It is highly possible that 90% of the data are tombstones. This case is very similar to the issue we solved before i.e. previously we needed to collect at least one chunk to let backfill proceed. Now, we need to collect at least one row to let backfill proceed. I think we can construct this case by creating a large table and deleting all rows from it and finally creating a mv on it to backfill. It seems we can only rely on a compactor to compact these tombstones or expose an internal key to let backfill proceed. |
and
Like the idea, but this can only solve the case that the tombstones are placed on the start. For example, the worst case is like |
#12702 Reproduced the scenario. |
Something to note for approach 2 (exposing tombstone records) is that the "tombstone" key needs to be treated like an actual key when dealing with updates. Consider that the backfill position at the end of an epoch is 4, and corresponds to a tombstone record. If some updates are produced which are less than or equal to this tombstone record, they must still be flushed downstream. Otherwise, when backfill resumes in the next epoch, we will lose these same updates. Although they are part of the snapshot in the next epoch, our backfill starts from >4, and will not see these same updates in the snapshot (which are <= 4). |
Here's a sample interface for relational layer #12708. Not sure what to modify for state store layer. We can use this to concretely discuss the interface. Left comments on the relevant changes. |
Actually... Thinking about it further, there's also additional complexities within backfill. Now each chunk will be larger, and contain tombstone rows which should not be yielded downstream. I guess we can mark those rows as invisible in the iterator, and compact them before yielding downstream. But this is something we should take note of. |
If I understand it correctly, all full scans on such tables will be affected, right? If we make specific fixes or improvements to Backfill, users may still find it confusing that batch scanning a temporal-filtered table is slow. This could be another reason to consider fixing it in a more general way... Found some materials that might be helpful: |
I think this problem is unique to backfill. Because backfill will refresh snapshot every epoch, and in the tombstone case, it will start from the beginning each time. Whereas for batch scanning, I think it can span multiple epochs, no need to start from scratch. Edit: It could also affect any scan which has the same pattern as backfill: refresh scan on epoch. |
This just affects backfill "stuck" issue. Regarding slow scan due to too many tombstones, you are right in that optimizing it solely in backfill cannot fix that. |
Currently storage only expose user key iterator to streaming. On creation of the user key iterator, it will seek to the start key of the provided key range and call next to advance to the next non-tombstone key in storage.
Backfill uses the storage table iterator, underlying of which is a user key iterator, to perform per vnode full table scan snapshot read on upstream table. Under this implementation, when the upstream table happens to have a large number of tombstone keys, which is typically the case when we have temporal filter to do TTL on the upstream table, the iterator creation can be blocked until the first non-tombstone key is seen. This issue is significant when backfill gets started since
vnode | unbounded
will be used as the start_key and the initial seek will not skip the tombstone keys.Some ideas:
vnode | min_key
as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).The text was updated successfully, but these errors were encountered: