`ParquetRecordBatchStream` API to fetch the next row group while decoding #6559

masonh22 · 2024-10-14T19:05:02Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I've noticed low CPU utilization when reading from filesystems with low
bandwidth using a ParquetRecordBatchStream. This appears to be caused by the
fact that the stream fetches row group data on demand rather than ahead of
time. In my specific scenario, I'm reading a parquet file from s3 with four
128MB row groups. It takes ~2 seconds to fetch the data and ~500ms to decode the
entire row group. In all, it takes around 10 seconds to read and decode the
entire file.

Describe the solution you'd like
I'd like to add the option for ParquetRecordBatchStream to fetch the data for
the next row group while decoding data for the current row group.

Describe alternatives you've considered

Additional context

The text was updated successfully, but these errors were encountered:

alamb · 2024-10-18T11:07:22Z

This is a very interesting idea -- I think it is a good one as this is quite a common case.

alamb · 2024-10-18T11:09:45Z

Right now users can implement their own pre-fetching logic (e.g. to batch object store requests) but they don't know what ranges of data will be used.

Maybe the API could be something that calculated the required ranges based on RowGroupMetadata that were selected what file ranges would be needed

To start this this project, I would recommend that we create a motivating example (with some sort of mocked out S3 thing) showing how we wanted to read the data

masonh22 · 2024-10-29T18:34:40Z

I was thinking that the pre-fetch logic could be pushed into the ParquetRecordBatchStream itself. I made a proof-of-concept of this here: masonh22@4f682b1

I didn't add any performance tests, but I confirmed that this sped things up in my own project by effectively removing the cost of decoding since reading the data takes longer than decoding it.

masonh22 added the enhancement Any new improvement worthy of a entry in the changelog label Oct 14, 2024

alamb added parquet Changes to the parquet crate help wanted labels Oct 18, 2024

masonh22 linked a pull request Nov 3, 2024 that will close this issue

Add a 'prefetch' option to ParquetRecordBatchStream to load the next row group while decoding #6676

Open

alamb linked a pull request Dec 20, 2024 that will close this issue

feat(parquet): Add next_row_group API for ParquetRecordBatchStream #6907

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ParquetRecordBatchStream` API to fetch the next row group while decoding #6559

`ParquetRecordBatchStream` API to fetch the next row group while decoding #6559

masonh22 commented Oct 14, 2024

alamb commented Oct 18, 2024 •

edited

Loading

alamb commented Oct 18, 2024

masonh22 commented Oct 29, 2024

ParquetRecordBatchStream API to fetch the next row group while decoding #6559

ParquetRecordBatchStream API to fetch the next row group while decoding #6559

Comments

masonh22 commented Oct 14, 2024

alamb commented Oct 18, 2024 • edited Loading

alamb commented Oct 18, 2024

masonh22 commented Oct 29, 2024

`ParquetRecordBatchStream` API to fetch the next row group while decoding #6559

`ParquetRecordBatchStream` API to fetch the next row group while decoding #6559

alamb commented Oct 18, 2024 •

edited

Loading