Support customizing row group reading process in async reader #5141

Rachelint · 2023-11-29T03:06:23Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I found decompression cost much cpu when using horaedb in production, so I decide to refactor our parquet memory cache to cache the decompressed pages rather than only raw bytes. However, I found can only finish it using the low level apis and unable to reuse a ton of codes.

Actually, I found greptimedb has the similar need for such a cache, and copy too many codes from parquet to finish it #2688.

I think maybe we can make the row group reading process an interface to support customizing for reusing the rest codes.

Describe the solution you'd like

I have tried to impl a poc for this, can see #5142 .

As I see, one row group's reading process maybe can be concluded as following:

calculate the pages ranges in row group.
fetch pages according to above rangs(compressed ranges).
decompress the pages.
decoding the pages.

Maybe we can define a trait as following, it interacts with other parts like this:

calculate and pass the pages ranges to get_row_group to fetch compressed ranges and return the in memory row group(which impls RowGroups trait).
call column_chunks of in memory row group to generate the decompressed page iterator as same as the original process.

And for we users can provide the customized AsyncRowGroupReader and RowGroups impls for reaching our targets (such as decompressed page cahce mentioned above).

pub trait AsyncRowGroupReader {
    type R: RowGroups;

    async fn get_row_group<T: AsyncFileReader + Send>(
        &mut self,
        input: &mut T,
        row_group_idx: usize,
        row_group_offsets: RowGroupRanges,
    ) -> Result<Self::R>;
}

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

Rachelint · 2023-11-29T03:21:50Z

@alamb @tustvold would mind giving some advices?

tustvold · 2023-11-29T10:49:09Z

Thank you for this, I am away for the next few days but I'll make some time next week to have a play and see how we can accommodate this use case

Rachelint · 2023-11-29T12:09:02Z

Thank you for this, I am away for the next few days but I'll make some time next week to have a play and see how we can accommodate this use case

Thanks a lot, wait for advices and hope I can help to do it.

Rachelint added the enhancement Any new improvement worthy of a entry in the changelog label Nov 29, 2023

tustvold mentioned this issue Mar 17, 2024

Low-Level Arrow Parquet Reader #5522

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support customizing row group reading process in async reader #5141

Support customizing row group reading process in async reader #5141

Rachelint commented Nov 29, 2023 •

edited

Loading

Rachelint commented Nov 29, 2023 •

edited

Loading

tustvold commented Nov 29, 2023

Rachelint commented Nov 29, 2023 •

edited

Loading

Support customizing row group reading process in async reader #5141

Support customizing row group reading process in async reader #5141

Comments

Rachelint commented Nov 29, 2023 • edited Loading

Rachelint commented Nov 29, 2023 • edited Loading

tustvold commented Nov 29, 2023

Rachelint commented Nov 29, 2023 • edited Loading

Rachelint commented Nov 29, 2023 •

edited

Loading

Rachelint commented Nov 29, 2023 •

edited

Loading

Rachelint commented Nov 29, 2023 •

edited

Loading