Improve backfill on table with a large number of tombstones #12680

hzxa21 · 2023-10-08T09:15:26Z

Currently storage only expose user key iterator to streaming. On creation of the user key iterator, it will seek to the start key of the provided key range and call next to advance to the next non-tombstone key in storage.

Backfill uses the storage table iterator, underlying of which is a user key iterator, to perform per vnode full table scan snapshot read on upstream table. Under this implementation, when the upstream table happens to have a large number of tombstone keys, which is typically the case when we have temporal filter to do TTL on the upstream table, the iterator creation can be blocked until the first non-tombstone key is seen. This issue is significant when backfill gets started since vnode | unbounded will be used as the start_key and the initial seek will not skip the tombstone keys.

Some ideas:

Storage expose a "internal key" iterator to backfill so that backfill can use the tombstone key for backfilled position. This ensures that backfill can make progress even though no data is returned.
Maintain table statistics to guide backfill. A simple statistic is per table min/max key. Backfill can use vnode | min_key as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).
Tune compaction to be more aggressive when the SST delete ratio is high. This can improve the situation but cannot solve it completely since it order to compaction to clean up a tombstone key completely, a full compaction is needed. When the data volume is high, full compaction may not be done in a fast manner even though compaction is triggered aggressively.

The text was updated successfully, but these errors were encountered:

hzxa21 · 2023-10-08T09:15:54Z

cc @fuyufjh @kwannoel @Li0k @Little-Wallace @chenzl25

kwannoel · 2023-10-09T03:08:49Z

Storage expose a "internal key" iterator to backfill so that backfill can use the tombstone key for backfilled position. This ensures that backfill can make progress even though no data is returned.

This sounds possible. It will allow us to resume from last backfill position after each barrier. Previously with tombstone, I guess it will not.
Maintain table statistics to guide backfill. A simple statistic is per table min/max key. Backfill can use vnode | min_key as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).

Why "subset" of issues? What issues can this not resolve? Actually I slightly prefer this to 1. If there's very high data workload, it could be possible that backfill cannot keep up. Seems to solve everything that 1 can, but better.
Tune compaction to be more aggressive when the SST delete ratio is high. This can improve the situation but cannot solve it completely since it order to compaction to clean up a tombstone key completely, a full compaction is needed. When the data volume is high, full compaction may not be done in a fast manner even though compaction is triggered aggressively.

3 seems complementary to 1 and 2.

kwannoel · 2023-10-09T03:15:32Z

To elaborate more on why this is severe:

the iterator creation can be blocked until the first non-tombstone key is seen.

On each barrier, we will create a new snapshot_read by default. If iterator could not be created, we will restart the process of iterator creation again.

So basically backfill will be stuck all the way, until compaction cleans up enough tombstone records, that iterator creation can work (which may take a long time, or it could never happen if data volume is too high).

hzxa21 · 2023-10-09T03:31:50Z

Maintain table statistics to guide backfill. A simple statistic is per table min/max key. Backfill can use vnode | min_key as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).
Why "subset" of issues? What issues can this not resolve? Actually I slightly prefer this to 1. If there's very high data workload, it could be possible that backfill cannot keep up. Seems to solve everything that 1 can, but better.

2 can only solve the issue under certain workload when the tombstone keys are accumulated at the beginning of the table, which is a subset of the issues. I prefer 1 too.

3 seems complementary to 1 and 2.

+1

kwannoel · 2023-10-09T04:12:05Z

2 can only solve the issue under certain workload when the tombstone keys are accumulated at the beginning of the table, which is a subset of the issues. I prefer 1 too.

I see, that can happen when we have multiple intervals of tombstones.
1 is most general, we can have 2 and 3 as optimizations.

For 1, here's what I'm thinking for implementation:

Within backfill, we pass in the iter_tombstone option into storage table's iterator.
For the storage table's iterator, it will continue to do vnode level iter, and merge, but it will pass in the iter_tombstone option to each vnode level iter as well.
In the state store layer we will also pass in iter_tombstone to its iterator, and when set, store_create_iter should immediately succeed.
The state store iterator should then also yield tombstone rows.

Within the backfill executor we have to be careful though. I'm not sure if we have to treat a current_pos which has actually already been deleted specially. For now I can't think of any issues. cc @chenzl25 .

If we do, perhaps we need some discriminant for whether a current_pos is a tombstone or not.

fuyufjh · 2023-10-09T04:16:32Z

It seems that "letting Backfill operator to iterate on each vnode at one moment" was not mentioned? I suppose this would be the simplest way to solve the problem. It didn't skip the tombstones (so the interface can be kept as it is now), but there will be only one iterator open.

fuyufjh · 2023-10-09T04:18:30Z

By the way, skipping tombstone is not specific to BackfillExecutor. Temporal filter / watermark are common pattern in streaming and all operator may encounter this problem. Thus, I don't like the idea:

Storage expose a "internal key" iterator to backfill so that backfill can use the tombstone key for backfilled position. This ensures that backfill can make progress even though no data is returned.

kwannoel · 2023-10-09T04:19:51Z

It seems that "letting Backfill operator to iteration on each vnode at one moment" was not mentioned? I suppose this would be the simplest way to solve the problem. It didn't skip the tombstones (so the interface can be kept as it is now), but there will be only one iterator open.

Even if backfill iter on each vnode, it will still encounter the issue. Because it cannot maintain the progress it made in the previous epoch.

kwannoel · 2023-10-09T04:20:21Z

By the way, skipping tombstone is not specific to BackfillExecutor. Temporal filter / watermark are common pattern in streaming and all operator may encounter this problem. Thus, I don't like the idea:

Storage expose a "internal key" iterator to backfill so that backfill can use the tombstone key for backfilled position. This ensures that backfill can make progress even though no data is returned.

It can be exposed as a state store interface, and other relational layer interfaces can use it.

fuyufjh · 2023-10-09T04:25:19Z

It can be exposed as a state store interface, and other relational layer interfaces can use it.

Even if backfill iter on each vnode, it will still encounter the issue. Because it cannot maintain the progress it made in the previous epoch.

Yeah, I can understand your intention. But this is somehow dirty to me - the caller needs to understand more about the implementation of storage now, which makes it a bad abstraction.

I recommend we don't do this unless necessary. How about try vnode-level iteration first? If that cannot solve the problem completely, we may try this afterwards.

fuyufjh · 2023-10-09T04:41:02Z

Note down the discussion with @kwannoel offline:

Even if we did vnode-level iteration, there can be a bad case: the create_iter() call lasts for more than 1 second, so it was interrupted; on the next barrier, it recurs again, while BackfillExecutor makes no progress at all.

To avoid this, one of the solution is to introduce a cerate_raw??_iter() which can return tombstones as well, so that 1) the cerate_raw??_iter() can return immediately 2) the BackfillExecutor can now know the progress of iteration, and continue from there in next barrier.

Note that this approach is very specific to BackfillExecutor. Not sure how much complexity it will introduce for storage side.

hzxa21 · 2023-10-09T05:55:49Z

Note down the discussion with @kwannoel offline:

Even if we did vnode-level iteration, there can be a bad case: the create_iter() call lasts for more than 1 second, so it was interrupted; on the next barrier, it recurs again, while BackfillExecutor makes no progress at all.

To avoid this, one of the solution is to introduce a cerate_raw??_iter() which can return tombstones as well, so that 1) the cerate_raw??_iter() can return immediately 2) the BackfillExecutor can now know the progress of iteration, and continue from there in next barrier.

Note that this approach is very specific to BackfillExecutor. Not sure how much complexity it will introduce for storage side.

As long as we can settle on the interface, it is not hard for storage to expose "raw iter" since the currently exposed user iterator is just a wrapper on top of the storage internal "raw iter".

BugenZhao · 2023-10-09T06:09:14Z

Maintain table statistics to guide backfill. A simple statistic is per table min/max key. Backfill can use vnode | min_key as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).

Could this be a general optimization for iterating the storage? By maintaining the min key of a table (or even for each vnode prefix), a full scan can be converted to a range scan in the storage itself to skip those tombstone keys. 🤔 So the application layer does not have to care about this..

chenzl25 · 2023-10-09T06:29:45Z

If we have an upstream MV and there are lots of deletes against it (especially from the temporal filter). These deletes will generate lots of tombstones in SSTs. Our compaction currently only actually delete those tombstone in the last level. Thinking about temporal filters only keeps the latest data in one day. It is highly possible that 90% of the data are tombstones. This case is very similar to the issue we solved before i.e. previously we needed to collect at least one chunk to let backfill proceed. Now, we need to collect at least one row to let backfill proceed. I think we can construct this case by creating a large table and deleting all rows from it and finally creating a mv on it to backfill. It seems we can only rely on a compactor to compact these tombstones or expose an internal key to let backfill proceed.

fuyufjh · 2023-10-09T06:56:37Z

Maintain table statistics to guide backfill. A simple statistic is per table min/max key. Backfill can use vnode | min_key as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).

and

Could this be a general optimization for iterating the storage? By maintaining the min key of a table (or even for each vnode prefix), a full scan can be converted to a range scan in the storage itself to skip those tombstone keys. 🤔 So the application layer does not have to care about this..

Like the idea, but this can only solve the case that the tombstones are placed on the start. For example, the worst case is like a T T T T .... T T T b c d ... (T for tombstone) then it won't work.

kwannoel · 2023-10-09T07:10:39Z

#12702 Reproduced the scenario.

kwannoel · 2023-10-09T08:03:21Z

Something to note for approach 2 (exposing tombstone records) is that the "tombstone" key needs to be treated like an actual key when dealing with updates.

Consider that the backfill position at the end of an epoch is 4, and corresponds to a tombstone record.

If some updates are produced which are less than or equal to this tombstone record, they must still be flushed downstream.

Otherwise, when backfill resumes in the next epoch, we will lose these same updates. Although they are part of the snapshot in the next epoch, our backfill starts from >4, and will not see these same updates in the snapshot (which are <= 4).

kwannoel · 2023-10-09T13:24:21Z

Note down the discussion with @kwannoel offline:
Even if we did vnode-level iteration, there can be a bad case: the create_iter() call lasts for more than 1 second, so it was interrupted; on the next barrier, it recurs again, while BackfillExecutor makes no progress at all.
To avoid this, one of the solution is to introduce a cerate_raw??_iter() which can return tombstones as well, so that 1) the cerate_raw??_iter() can return immediately 2) the BackfillExecutor can now know the progress of iteration, and continue from there in next barrier.
Note that this approach is very specific to BackfillExecutor. Not sure how much complexity it will introduce for storage side.

As long as we can settle on the interface, it is not hard for storage to expose "raw iter" since the currently exposed user iterator is just a wrapper on top of the storage internal "raw iter".

Here's a sample interface for relational layer #12708.

Not sure what to modify for state store layer. We can use this to concretely discuss the interface.

Left comments on the relevant changes.

kwannoel · 2023-10-09T13:39:46Z

Actually... Thinking about it further, there's also additional complexities within backfill. Now each chunk will be larger, and contain tombstone rows which should not be yielded downstream.

I guess we can mark those rows as invisible in the iterator, and compact them before yielding downstream.

But this is something we should take note of.

BugenZhao · 2023-10-10T14:52:47Z

If I understand it correctly, all full scans on such tables will be affected, right? If we make specific fixes or improvements to Backfill, users may still find it confusing that batch scanning a temporal-filtered table is slow. This could be another reason to consider fixing it in a more general way...

Found some materials that might be helpful:

http://smalldatum.blogspot.com/2020/01/deletes-are-fast-and-slow-in-lsm.html

kwannoel · 2023-10-11T01:36:38Z

If I understand it correctly, all full scans on such tables will be affected, right? If we make specific fixes or improvements to Backfill, users may still find it confusing that batch scanning a temporal-filtered table is slow. This could be another reason to consider fixing it in a more general way...

Found some materials that might be helpful:
* http://smalldatum.blogspot.com/2020/01/deletes-are-fast-and-slow-in-lsm.html

I think this problem is unique to backfill. Because backfill will refresh snapshot every epoch, and in the tombstone case, it will start from the beginning each time. Whereas for batch scanning, I think it can span multiple epochs, no need to start from scratch.

Edit: It could also affect any scan which has the same pattern as backfill: refresh scan on epoch.

kwannoel · 2023-10-11T02:01:40Z

If I understand it correctly, all full scans on such tables will be affected, right? If we make specific fixes or improvements to Backfill, users may still find it confusing that batch scanning a temporal-filtered table is slow. This could be another reason to consider fixing it in a more general way...
Found some materials that might be helpful:
* http://smalldatum.blogspot.com/2020/01/deletes-are-fast-and-slow-in-lsm.html
I think this problem is unique to backfill. Because backfill will refresh snapshot every epoch, and in the tombstone case, it will start from the beginning each time. Whereas for batch scanning, I think it can span multiple epochs, no need to start from scratch.

Edit: It could also affect any scan which has the same pattern as backfill: refresh scan on epoch.

This just affects backfill "stuck" issue. Regarding slow scan due to too many tombstones, you are right in that optimizing it solely in backfill cannot fix that.

kwannoel · 2023-10-13T15:41:46Z

Closed by #12780 for backfill executor.

#12776
#12670
for compaction.

hzxa21 added the type/feature label Oct 8, 2023

github-actions bot added this to the release-1.3 milestone Oct 8, 2023

kwannoel mentioned this issue Oct 9, 2023

feat(streaming): implement arrangement backfill frontend #10400

Closed

23 tasks

kwannoel mentioned this issue Oct 9, 2023

feat(stream): support snapshot read over N barriers #12644

Merged

8 tasks

fuyufjh modified the milestones: release-1.3, release-1.4 Oct 10, 2023

BugenZhao mentioned this issue Oct 10, 2023

fix(backfill): no need initial snapshot read #12740

Merged

8 tasks

kwannoel mentioned this issue Oct 11, 2023

feat(stream): support tombstone row as backfill position #12708

Closed

8 tasks

kwannoel closed this as completed Oct 13, 2023

BugenZhao mentioned this issue Oct 17, 2023

perf: table scan with too many delete tombstones is slow #12903

Open

kwannoel mentioned this issue Oct 19, 2023

Kwannoel/tombstone with repro #12759

Closed

8 tasks

kwannoel mentioned this issue Mar 20, 2024

feat(config): make arrangement backfill default #14846

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve backfill on table with a large number of tombstones #12680

Improve backfill on table with a large number of tombstones #12680

hzxa21 commented Oct 8, 2023

hzxa21 commented Oct 8, 2023

kwannoel commented Oct 9, 2023 •

edited

Loading

kwannoel commented Oct 9, 2023

hzxa21 commented Oct 9, 2023

kwannoel commented Oct 9, 2023 •

edited

Loading

fuyufjh commented Oct 9, 2023 •

edited

Loading

fuyufjh commented Oct 9, 2023 •

edited

Loading

kwannoel commented Oct 9, 2023

kwannoel commented Oct 9, 2023

fuyufjh commented Oct 9, 2023

fuyufjh commented Oct 9, 2023 •

edited

Loading

hzxa21 commented Oct 9, 2023

BugenZhao commented Oct 9, 2023

chenzl25 commented Oct 9, 2023

fuyufjh commented Oct 9, 2023 •

edited

Loading

kwannoel commented Oct 9, 2023

kwannoel commented Oct 9, 2023

kwannoel commented Oct 9, 2023 •

edited

Loading

kwannoel commented Oct 9, 2023

BugenZhao commented Oct 10, 2023 •

edited

Loading

kwannoel commented Oct 11, 2023 •

edited

Loading

kwannoel commented Oct 11, 2023

kwannoel commented Oct 13, 2023

Improve backfill on table with a large number of tombstones #12680

Improve backfill on table with a large number of tombstones #12680

Comments

hzxa21 commented Oct 8, 2023

hzxa21 commented Oct 8, 2023

kwannoel commented Oct 9, 2023 • edited Loading

kwannoel commented Oct 9, 2023

hzxa21 commented Oct 9, 2023

kwannoel commented Oct 9, 2023 • edited Loading

fuyufjh commented Oct 9, 2023 • edited Loading

fuyufjh commented Oct 9, 2023 • edited Loading

kwannoel commented Oct 9, 2023

kwannoel commented Oct 9, 2023

fuyufjh commented Oct 9, 2023

fuyufjh commented Oct 9, 2023 • edited Loading

hzxa21 commented Oct 9, 2023

BugenZhao commented Oct 9, 2023

chenzl25 commented Oct 9, 2023

fuyufjh commented Oct 9, 2023 • edited Loading

kwannoel commented Oct 9, 2023

kwannoel commented Oct 9, 2023

kwannoel commented Oct 9, 2023 • edited Loading

kwannoel commented Oct 9, 2023

BugenZhao commented Oct 10, 2023 • edited Loading

kwannoel commented Oct 11, 2023 • edited Loading

kwannoel commented Oct 11, 2023

kwannoel commented Oct 13, 2023

kwannoel commented Oct 9, 2023 •

edited

Loading

kwannoel commented Oct 9, 2023 •

edited

Loading

fuyufjh commented Oct 9, 2023 •

edited

Loading

fuyufjh commented Oct 9, 2023 •

edited

Loading

fuyufjh commented Oct 9, 2023 •

edited

Loading

fuyufjh commented Oct 9, 2023 •

edited

Loading

kwannoel commented Oct 9, 2023 •

edited

Loading

BugenZhao commented Oct 10, 2023 •

edited

Loading

kwannoel commented Oct 11, 2023 •

edited

Loading