Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support customizing row group reading process in async reader #5141

Open
Rachelint opened this issue Nov 29, 2023 · 3 comments
Open

Support customizing row group reading process in async reader #5141

Rachelint opened this issue Nov 29, 2023 · 3 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@Rachelint
Copy link
Contributor

Rachelint commented Nov 29, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I found decompression cost much cpu when using horaedb in production, so I decide to refactor our parquet memory cache to cache the decompressed pages rather than only raw bytes. However, I found can only finish it using the low level apis and unable to reuse a ton of codes.

Actually, I found greptimedb has the similar need for such a cache, and copy too many codes from parquet to finish it #2688.

I think maybe we can make the row group reading process an interface to support customizing for reusing the rest codes.

Describe the solution you'd like

I have tried to impl a poc for this, can see #5142 .

As I see, one row group's reading process maybe can be concluded as following:

  • calculate the pages ranges in row group.
  • fetch pages according to above rangs(compressed ranges).
  • decompress the pages.
  • decoding the pages.

Maybe we can define a trait as following, it interacts with other parts like this:

  • calculate and pass the pages ranges to get_row_group to fetch compressed ranges and return the in memory row group(which impls RowGroups trait).
  • call column_chunks of in memory row group to generate the decompressed page iterator as same as the original process.

And for we users can provide the customized AsyncRowGroupReader and RowGroups impls for reaching our targets (such as decompressed page cahce mentioned above).

pub trait AsyncRowGroupReader {
    type R: RowGroups;

    async fn get_row_group<T: AsyncFileReader + Send>(
        &mut self,
        input: &mut T,
        row_group_idx: usize,
        row_group_offsets: RowGroupRanges,
    ) -> Result<Self::R>;
}

Describe alternatives you've considered

Additional context

@Rachelint Rachelint added the enhancement Any new improvement worthy of a entry in the changelog label Nov 29, 2023
@Rachelint
Copy link
Contributor Author

Rachelint commented Nov 29, 2023

@alamb @tustvold would mind giving some advices?

@tustvold
Copy link
Contributor

Thank you for this, I am away for the next few days but I'll make some time next week to have a play and see how we can accommodate this use case

@Rachelint
Copy link
Contributor Author

Rachelint commented Nov 29, 2023

Thank you for this, I am away for the next few days but I'll make some time next week to have a play and see how we can accommodate this use case

Thanks a lot, wait for advices and hope I can help to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants