Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetRecordBatchStream API to fetch the next row group while decoding #6559

Open
masonh22 opened this issue Oct 14, 2024 · 3 comments · May be fixed by #6676 or #6907
Open

ParquetRecordBatchStream API to fetch the next row group while decoding #6559

masonh22 opened this issue Oct 14, 2024 · 3 comments · May be fixed by #6676 or #6907
Labels
enhancement Any new improvement worthy of a entry in the changelog help wanted parquet Changes to the parquet crate

Comments

@masonh22
Copy link

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I've noticed low CPU utilization when reading from filesystems with low
bandwidth using a ParquetRecordBatchStream. This appears to be caused by the
fact that the stream fetches row group data on demand rather than ahead of
time. In my specific scenario, I'm reading a parquet file from s3 with four
128MB row groups. It takes ~2 seconds to fetch the data and ~500ms to decode the
entire row group. In all, it takes around 10 seconds to read and decode the
entire file.

Describe the solution you'd like
I'd like to add the option for ParquetRecordBatchStream to fetch the data for
the next row group while decoding data for the current row group.

Describe alternatives you've considered

Additional context

@masonh22 masonh22 added the enhancement Any new improvement worthy of a entry in the changelog label Oct 14, 2024
@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

This is a very interesting idea -- I think it is a good one as this is quite a common case.

@alamb alamb added parquet Changes to the parquet crate help wanted labels Oct 18, 2024
@alamb
Copy link
Contributor

alamb commented Oct 18, 2024

Right now users can implement their own pre-fetching logic (e.g. to batch object store requests) but they don't know what ranges of data will be used.

Maybe the API could be something that calculated the required ranges based on RowGroupMetadata that were selected what file ranges would be needed

To start this this project, I would recommend that we create a motivating example (with some sort of mocked out S3 thing) showing how we wanted to read the data

@masonh22
Copy link
Author

I was thinking that the pre-fetch logic could be pushed into the ParquetRecordBatchStream itself. I made a proof-of-concept of this here: masonh22@4f682b1

I didn't add any performance tests, but I confirmed that this sped things up in my own project by effectively removing the cost of decoding since reading the data takes longer than decoding it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog help wanted parquet Changes to the parquet crate
Projects
None yet
2 participants