You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I found decompression cost much cpu when using horaedb in production, so I decide to refactor our parquet memory cache to cache the decompressed pages rather than only raw bytes. However, I found can only finish it using the low level apis and unable to reuse a ton of codes.
Actually, I found greptimedb has the similar need for such a cache, and copy too many codes from parquet to finish it #2688.
I think maybe we can make the row group reading process an interface to support customizing for reusing the rest codes.
Describe the solution you'd like
I have tried to impl a poc for this, can see #5142 .
As I see, one row group's reading process maybe can be concluded as following:
calculate the pages ranges in row group.
fetch pages according to above rangs(compressed ranges).
decompress the pages.
decoding the pages.
Maybe we can define a trait as following, it interacts with other parts like this:
calculate and pass the pages ranges to get_row_group to fetch compressed ranges and return the in memory row group(which impls RowGroups trait).
call column_chunks of in memory row group to generate the decompressed page iterator as same as the original process.
And for we users can provide the customized AsyncRowGroupReader and RowGroups impls for reaching our targets (such as decompressed page cahce mentioned above).
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I found decompression cost much cpu when using horaedb in production, so I decide to refactor our parquet memory cache to cache the decompressed pages rather than only raw bytes. However, I found can only finish it using the low level apis and unable to reuse a ton of codes.
Actually, I found greptimedb has the similar need for such a cache, and copy too many codes from parquet to finish it #2688.
I think maybe we can make the row group reading process an interface to support customizing for reusing the rest codes.
Describe the solution you'd like
I have tried to impl a poc for this, can see #5142 .
As I see, one row group's reading process maybe can be concluded as following:
Maybe we can define a trait as following, it interacts with other parts like this:
get_row_group
to fetch compressed ranges and return the in memory row group(which impls RowGroups trait).column_chunks
of in memory row group to generate the decompressed page iterator as same as the original process.And for we users can provide the customized
AsyncRowGroupReader
andRowGroups
impls for reaching our targets (such as decompressed page cahce mentioned above).Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: