Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve backfill on table with a large number of tombstones #12680

Closed
Tracked by #10400
hzxa21 opened this issue Oct 8, 2023 · 23 comments
Closed
Tracked by #10400

Improve backfill on table with a large number of tombstones #12680

hzxa21 opened this issue Oct 8, 2023 · 23 comments

Comments

@hzxa21
Copy link
Collaborator

hzxa21 commented Oct 8, 2023

Currently storage only expose user key iterator to streaming. On creation of the user key iterator, it will seek to the start key of the provided key range and call next to advance to the next non-tombstone key in storage.

Backfill uses the storage table iterator, underlying of which is a user key iterator, to perform per vnode full table scan snapshot read on upstream table. Under this implementation, when the upstream table happens to have a large number of tombstone keys, which is typically the case when we have temporal filter to do TTL on the upstream table, the iterator creation can be blocked until the first non-tombstone key is seen. This issue is significant when backfill gets started since vnode | unbounded will be used as the start_key and the initial seek will not skip the tombstone keys.

Some ideas:

  • Storage expose a "internal key" iterator to backfill so that backfill can use the tombstone key for backfilled position. This ensures that backfill can make progress even though no data is returned.
  • Maintain table statistics to guide backfill. A simple statistic is per table min/max key. Backfill can use vnode | min_key as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).
  • Tune compaction to be more aggressive when the SST delete ratio is high. This can improve the situation but cannot solve it completely since it order to compaction to clean up a tombstone key completely, a full compaction is needed. When the data volume is high, full compaction may not be done in a fast manner even though compaction is triggered aggressively.
@github-actions github-actions bot added this to the release-1.3 milestone Oct 8, 2023
@hzxa21
Copy link
Collaborator Author

hzxa21 commented Oct 8, 2023

@kwannoel
Copy link
Contributor

kwannoel commented Oct 9, 2023

  1. Storage expose a "internal key" iterator to backfill so that backfill can use the tombstone key for backfilled position. This ensures that backfill can make progress even though no data is returned.

    This sounds possible. It will allow us to resume from last backfill position after each barrier. Previously with tombstone, I guess it will not.

  2. Maintain table statistics to guide backfill. A simple statistic is per table min/max key. Backfill can use vnode | min_key as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).

    Why "subset" of issues? What issues can this not resolve? Actually I slightly prefer this to 1. If there's very high data workload, it could be possible that backfill cannot keep up. Seems to solve everything that 1 can, but better.

  3. Tune compaction to be more aggressive when the SST delete ratio is high. This can improve the situation but cannot solve it completely since it order to compaction to clean up a tombstone key completely, a full compaction is needed. When the data volume is high, full compaction may not be done in a fast manner even though compaction is triggered aggressively.

    3 seems complementary to 1 and 2.

@kwannoel
Copy link
Contributor

kwannoel commented Oct 9, 2023

To elaborate more on why this is severe:

the iterator creation can be blocked until the first non-tombstone key is seen.

On each barrier, we will create a new snapshot_read by default. If iterator could not be created, we will restart the process of iterator creation again.

So basically backfill will be stuck all the way, until compaction cleans up enough tombstone records, that iterator creation can work (which may take a long time, or it could never happen if data volume is too high).

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Oct 9, 2023

  1. Maintain table statistics to guide backfill. A simple statistic is per table min/max key. Backfill can use vnode | min_key as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).
    Why "subset" of issues? What issues can this not resolve? Actually I slightly prefer this to 1. If there's very high data workload, it could be possible that backfill cannot keep up. Seems to solve everything that 1 can, but better.

2 can only solve the issue under certain workload when the tombstone keys are accumulated at the beginning of the table, which is a subset of the issues. I prefer 1 too.

3 seems complementary to 1 and 2.

+1

@kwannoel
Copy link
Contributor

kwannoel commented Oct 9, 2023

2 can only solve the issue under certain workload when the tombstone keys are accumulated at the beginning of the table, which is a subset of the issues. I prefer 1 too.

I see, that can happen when we have multiple intervals of tombstones.
1 is most general, we can have 2 and 3 as optimizations.

For 1, here's what I'm thinking for implementation:

  1. Within backfill, we pass in the iter_tombstone option into storage table's iterator.
  2. For the storage table's iterator, it will continue to do vnode level iter, and merge, but it will pass in the iter_tombstone option to each vnode level iter as well.
  3. In the state store layer we will also pass in iter_tombstone to its iterator, and when set, store_create_iter should immediately succeed.
  4. The state store iterator should then also yield tombstone rows.

Within the backfill executor we have to be careful though. I'm not sure if we have to treat a current_pos which has actually already been deleted specially. For now I can't think of any issues. cc @chenzl25 .

If we do, perhaps we need some discriminant for whether a current_pos is a tombstone or not.

@fuyufjh
Copy link
Member

fuyufjh commented Oct 9, 2023

It seems that "letting Backfill operator to iterate on each vnode at one moment" was not mentioned? I suppose this would be the simplest way to solve the problem. It didn't skip the tombstones (so the interface can be kept as it is now), but there will be only one iterator open.

@fuyufjh
Copy link
Member

fuyufjh commented Oct 9, 2023

By the way, skipping tombstone is not specific to BackfillExecutor. Temporal filter / watermark are common pattern in streaming and all operator may encounter this problem. Thus, I don't like the idea:

Storage expose a "internal key" iterator to backfill so that backfill can use the tombstone key for backfilled position. This ensures that backfill can make progress even though no data is returned.

@kwannoel
Copy link
Contributor

kwannoel commented Oct 9, 2023

It seems that "letting Backfill operator to iteration on each vnode at one moment" was not mentioned? I suppose this would be the simplest way to solve the problem. It didn't skip the tombstones (so the interface can be kept as it is now), but there will be only one iterator open.

Even if backfill iter on each vnode, it will still encounter the issue. Because it cannot maintain the progress it made in the previous epoch.

@kwannoel
Copy link
Contributor

kwannoel commented Oct 9, 2023

By the way, skipping tombstone is not specific to BackfillExecutor. Temporal filter / watermark are common pattern in streaming and all operator may encounter this problem. Thus, I don't like the idea:

Storage expose a "internal key" iterator to backfill so that backfill can use the tombstone key for backfilled position. This ensures that backfill can make progress even though no data is returned.

It can be exposed as a state store interface, and other relational layer interfaces can use it.

@fuyufjh
Copy link
Member

fuyufjh commented Oct 9, 2023

It can be exposed as a state store interface, and other relational layer interfaces can use it.

Even if backfill iter on each vnode, it will still encounter the issue. Because it cannot maintain the progress it made in the previous epoch.

Yeah, I can understand your intention. But this is somehow dirty to me - the caller needs to understand more about the implementation of storage now, which makes it a bad abstraction.

I recommend we don't do this unless necessary. How about try vnode-level iteration first? If that cannot solve the problem completely, we may try this afterwards.

@fuyufjh
Copy link
Member

fuyufjh commented Oct 9, 2023

Note down the discussion with @kwannoel offline:

Even if we did vnode-level iteration, there can be a bad case: the create_iter() call lasts for more than 1 second, so it was interrupted; on the next barrier, it recurs again, while BackfillExecutor makes no progress at all.

To avoid this, one of the solution is to introduce a cerate_raw??_iter() which can return tombstones as well, so that 1) the cerate_raw??_iter() can return immediately 2) the BackfillExecutor can now know the progress of iteration, and continue from there in next barrier.

Note that this approach is very specific to BackfillExecutor. Not sure how much complexity it will introduce for storage side.

@hzxa21
Copy link
Collaborator Author

hzxa21 commented Oct 9, 2023

Note down the discussion with @kwannoel offline:

Even if we did vnode-level iteration, there can be a bad case: the create_iter() call lasts for more than 1 second, so it was interrupted; on the next barrier, it recurs again, while BackfillExecutor makes no progress at all.

To avoid this, one of the solution is to introduce a cerate_raw??_iter() which can return tombstones as well, so that 1) the cerate_raw??_iter() can return immediately 2) the BackfillExecutor can now know the progress of iteration, and continue from there in next barrier.

Note that this approach is very specific to BackfillExecutor. Not sure how much complexity it will introduce for storage side.

As long as we can settle on the interface, it is not hard for storage to expose "raw iter" since the currently exposed user iterator is just a wrapper on top of the storage internal "raw iter".

@BugenZhao
Copy link
Member

  • Maintain table statistics to guide backfill. A simple statistic is per table min/max key. Backfill can use vnode | min_key as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).

Could this be a general optimization for iterating the storage? By maintaining the min key of a table (or even for each vnode prefix), a full scan can be converted to a range scan in the storage itself to skip those tombstone keys. 🤔 So the application layer does not have to care about this..

@chenzl25
Copy link
Contributor

chenzl25 commented Oct 9, 2023

If we have an upstream MV and there are lots of deletes against it (especially from the temporal filter). These deletes will generate lots of tombstones in SSTs. Our compaction currently only actually delete those tombstone in the last level. Thinking about temporal filters only keeps the latest data in one day. It is highly possible that 90% of the data are tombstones. This case is very similar to the issue we solved before i.e. previously we needed to collect at least one chunk to let backfill proceed. Now, we need to collect at least one row to let backfill proceed. I think we can construct this case by creating a large table and deleting all rows from it and finally creating a mv on it to backfill. It seems we can only rely on a compactor to compact these tombstones or expose an internal key to let backfill proceed.

@fuyufjh
Copy link
Member

fuyufjh commented Oct 9, 2023

Maintain table statistics to guide backfill. A simple statistic is per table min/max key. Backfill can use vnode | min_key as the start_key when it begins. This can solve a subset of the issues when the upstream table performs sequential deletes (e.g. TTL).

and

Could this be a general optimization for iterating the storage? By maintaining the min key of a table (or even for each vnode prefix), a full scan can be converted to a range scan in the storage itself to skip those tombstone keys. 🤔 So the application layer does not have to care about this..

Like the idea, but this can only solve the case that the tombstones are placed on the start. For example, the worst case is like a T T T T .... T T T b c d ... (T for tombstone) then it won't work.

@kwannoel
Copy link
Contributor

kwannoel commented Oct 9, 2023

#12702 Reproduced the scenario.

@kwannoel
Copy link
Contributor

kwannoel commented Oct 9, 2023

Something to note for approach 2 (exposing tombstone records) is that the "tombstone" key needs to be treated like an actual key when dealing with updates.

Consider that the backfill position at the end of an epoch is 4, and corresponds to a tombstone record.

If some updates are produced which are less than or equal to this tombstone record, they must still be flushed downstream.

Otherwise, when backfill resumes in the next epoch, we will lose these same updates. Although they are part of the snapshot in the next epoch, our backfill starts from >4, and will not see these same updates in the snapshot (which are <= 4).

@kwannoel
Copy link
Contributor

kwannoel commented Oct 9, 2023

Note down the discussion with @kwannoel offline:
Even if we did vnode-level iteration, there can be a bad case: the create_iter() call lasts for more than 1 second, so it was interrupted; on the next barrier, it recurs again, while BackfillExecutor makes no progress at all.
To avoid this, one of the solution is to introduce a cerate_raw??_iter() which can return tombstones as well, so that 1) the cerate_raw??_iter() can return immediately 2) the BackfillExecutor can now know the progress of iteration, and continue from there in next barrier.
Note that this approach is very specific to BackfillExecutor. Not sure how much complexity it will introduce for storage side.

As long as we can settle on the interface, it is not hard for storage to expose "raw iter" since the currently exposed user iterator is just a wrapper on top of the storage internal "raw iter".

Here's a sample interface for relational layer #12708.

Not sure what to modify for state store layer. We can use this to concretely discuss the interface.

Left comments on the relevant changes.

@kwannoel
Copy link
Contributor

kwannoel commented Oct 9, 2023

Actually... Thinking about it further, there's also additional complexities within backfill. Now each chunk will be larger, and contain tombstone rows which should not be yielded downstream.

I guess we can mark those rows as invisible in the iterator, and compact them before yielding downstream.

But this is something we should take note of.

@BugenZhao
Copy link
Member

BugenZhao commented Oct 10, 2023

If I understand it correctly, all full scans on such tables will be affected, right? If we make specific fixes or improvements to Backfill, users may still find it confusing that batch scanning a temporal-filtered table is slow. This could be another reason to consider fixing it in a more general way...

Found some materials that might be helpful:

@kwannoel
Copy link
Contributor

kwannoel commented Oct 11, 2023

If I understand it correctly, all full scans on such tables will be affected, right? If we make specific fixes or improvements to Backfill, users may still find it confusing that batch scanning a temporal-filtered table is slow. This could be another reason to consider fixing it in a more general way...

Found some materials that might be helpful:

* http://smalldatum.blogspot.com/2020/01/deletes-are-fast-and-slow-in-lsm.html

I think this problem is unique to backfill. Because backfill will refresh snapshot every epoch, and in the tombstone case, it will start from the beginning each time. Whereas for batch scanning, I think it can span multiple epochs, no need to start from scratch.

Edit: It could also affect any scan which has the same pattern as backfill: refresh scan on epoch.

@kwannoel
Copy link
Contributor

If I understand it correctly, all full scans on such tables will be affected, right? If we make specific fixes or improvements to Backfill, users may still find it confusing that batch scanning a temporal-filtered table is slow. This could be another reason to consider fixing it in a more general way...
Found some materials that might be helpful:

* http://smalldatum.blogspot.com/2020/01/deletes-are-fast-and-slow-in-lsm.html

I think this problem is unique to backfill. Because backfill will refresh snapshot every epoch, and in the tombstone case, it will start from the beginning each time. Whereas for batch scanning, I think it can span multiple epochs, no need to start from scratch.

Edit: It could also affect any scan which has the same pattern as backfill: refresh scan on epoch.

This just affects backfill "stuck" issue. Regarding slow scan due to too many tombstones, you are right in that optimizing it solely in backfill cannot fix that.

@kwannoel
Copy link
Contributor

Closed by #12780 for backfill executor.

#12776
#12670
for compaction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants