You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For streaming systems (or batch systems that run in high frequency) that write data into delta tables, it's a common problem to have lots of small files. In many cases, it's not practical to auto compact because of various reasons, for example
Auto compaction is not available in Delta lake before 3.1.0
Auto compaction might not be well supported outside Spark
One way to solve this is to have a separate process that perform optimization regularly on these delta tables. However it's not a good idea to optimize the entire table whenever without any constraint. A few example reasons:
While in theory optimize is a no-op if the partitions weren't updated, it still takes some overhead per partition to determine it's a no-op. This could add up quite significantly when you have lots of partitions.
If the optimize operation included z-order, subsequent z-order operations won't be no-op even if the partitions weren't updated
Describe the solution you'd like
A helper function to find out which partitions have been updated between some time period, for example
The exclude_optimize_operations flag is necessary because optimization operations themselves are also update operations. If the operations are not excluded, they might cause a feedback loop since they will keep showing up in the output.
All the information needed for this features should be available in the transaction log.
Describe alternatives you've considered
Optimizing the entire table and accept the overhead
Not sure what's a good alternative once z-order is used however
Additional context
N/A
Willingness to contribute
Would you be willing to contribute an implementation of this feature?
Yes. I can contribute this feature independently.
Yes. I would be willing to contribute this feature with guidance from the mack community.
No. I cannot contribute this feature at this time.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Similar to MrPowers/mack#130 , but for non-Spark projects
For streaming systems (or batch systems that run in high frequency) that write data into delta tables, it's a common problem to have lots of small files. In many cases, it's not practical to auto compact because of various reasons, for example
One way to solve this is to have a separate process that perform optimization regularly on these delta tables. However it's not a good idea to optimize the entire table whenever without any constraint. A few example reasons:
Describe the solution you'd like
A helper function to find out which partitions have been updated between some time period, for example
The
exclude_optimize_operations
flag is necessary because optimization operations themselves are also update operations. If the operations are not excluded, they might cause a feedback loop since they will keep showing up in the output.All the information needed for this features should be available in the transaction log.
Describe alternatives you've considered
Optimizing the entire table and accept the overhead
Not sure what's a good alternative once z-order is used however
Additional context
N/A
Willingness to contribute
Would you be willing to contribute an implementation of this feature?
The text was updated successfully, but these errors were encountered: