Add `ParquetMetaDataReader` #6431

etseidl · 2024-09-20T22:00:47Z

Which issue does this PR close?

Part of #6002.

Rationale for this change

Consolidate Parquet metadata parsing into a single API. See discussion in #6392 for additional context.

What changes are included in this PR?

Adds the ParquetMetaDataReader struct.

Are there any user-facing changes?

No

etseidl · 2024-09-20T22:04:24Z

parquet/src/file/metadata/reader.rs

+        // TODO(ets): what is the correct behavior for missing page indexes? MetadataLoader would
+        // leave them as `None`, while the parser in `index_reader::read_columns_indexes` returns a
+        // vector of empty vectors.
+        // I think it's best to leave them as `None`.


This is the remaining outstanding issue from #6392. My preference is to leave the page indexes as None if they are not present, but this differs from current behavior.

I agree let's leave them as None if they aren't present.

… pq_metadata_reader

alamb

Thank you @etseidl -- I think this looks amazing 😍 . It is really nicely documented and structured.

I got about 1/3 of the way through and will return to it tomorrow.

cc @adriangb and @XiangpengHao who I think are also interested in these apis

It would be cool (we can do it as a follow on PR), to update the "table of content" type of documentation here:

alamb · 2024-09-23T20:06:55Z

parquet/src/file/metadata/reader.rs

+
+    /// Given a [`ChunkReader`], parse and return the [`ParquetMetaData`] in a single pass.
+    ///
+    /// If `reader` is [`Bytes`] based, then the buffer must contain sufficient bytes to complete


It would be really neat if @adriangb could comment / provide an example of using this API even when we don't have the entire file (aka faking out the offsets). Definitely as a follow on

A summary:

You have the parquet bytes in memory. From that you load the ParquetMetaData including page indexes.

Use the new writer to write ParquetMetadata to bytes and store the bytes in a K/V store.

When I want to load this file again I start by getting the metadata bytes from the K/V store.

I decode those bytes using this new API, passing in the original file size to adjust the offsets.

I now have a ParquetMetaData in memory that I can pass back so that upstream things can decide if they even need to read the rest of the file or what pages in the file they need to read.

This should be very similar to what's going on in #6081 but I'm a bit confused about what's going on there

parquet/src/file/metadata/reader.rs

alamb · 2024-09-23T20:12:33Z

parquet/src/file/metadata/reader.rs

+    ///     .with_page_indexes(true)
+    ///     .parse(&file).unwrap();
+    /// ```
+    pub fn parse<R: ChunkReader>(mut self, reader: &R) -> Result<ParquetMetaData> {


I found the fact that parse consumed self was slightly confusing given try_parse did not.

Maybe we could call it build() or parse_and_finish to reflect it returns Self 🤔

For now I'll opt for parse_and_finish (and load_and_finish below). Happy to change if there's a better suggestion.

parquet/src/file/metadata/reader.rs

alamb · 2024-09-23T20:14:24Z

parquet/src/file/metadata/reader.rs

+    /// is a [`Bytes`] struct that does not contain the entire file. This information is necessary
+    /// when the page indexes are desired. `reader` must have access to the Parquet footer.
+    ///
+    /// Using this function also allows for retrying with a larger buffer.


I think it would help here to document what errors are returned (specifically how "needs more buffer" is communicated)

I see it is partly covered in the example

Co-authored-by: Andrew Lamb <[email protected]>

etseidl · 2024-09-23T21:12:24Z

FWIW CI errors appear unrelated to this PR.

alamb

TLDR is I think this looks great. Thanks again @etseidl

I also think that the POC done in #6392 gives me confidence that this API is sufficient to replace the existing uses in the crate with this new structure.

I am also quite excited about the potential usecases this API opens up (like potentially being able to only create Rust structures for the metadata for certain row groups)

I updated my "store metadata outside the file" example with this API here: #6081 and it was 👨‍🍳 👌 very nice.

I wonder if we should list out somewhere on a ticket the APIs that should be consolidated / deprecated as follow on PRs (e.g. decode_footer, decode_metadata, etc)?

etseidl · 2024-09-24T16:22:42Z

Thanks @alamb 🙏

I wonder if we should list out somewhere on a ticket the APIs that should be consolidated / deprecated as follow on PRs (e.g. decode_footer, decode_metadata, etc)?

Yes, that's a good idea. The two obvious places are footer.rs and MetadataLoader, but the arrow readers also have pockets of footer decoding in them. I'll go through #6392 and come up with a list that can go into an issue.

Added #6447

etseidl · 2024-09-24T18:54:15Z

@alamb I am still concerned about #6431 (comment). I think it's fine for this new API to follow what MetadataLoader does (i.e. leave the page indexes in ParquetMetaData set to None if they are not present in the file), but elsewhere missing page indexes are left as empty Vecs. I believe changing this behavior will be breaking, so will have to be deferred until 54.0.0. Do you agree?

alamb · 2024-09-24T22:03:06Z

Given that the POC has been open for a while as has this PR, I don't think there is any reason to hold off merging, so here we go!

Thanks again @etseidl

alamb · 2024-09-24T22:05:32Z

but elsewhere missing page indexes are left as empty Vecs. I believe changing this behavior will be breaking, so will have to be deferred until 54.0.0. Do you agree?

Maybe as part of #6447 we can add some sort of workaround to switch to vec![] to preserve the existing behavior until the next breaking release when we can change the behavior

etseidl · 2024-09-24T22:15:16Z

but elsewhere missing page indexes are left as empty Vecs. I believe changing this behavior will be breaking, so will have to be deferred until 54.0.0. Do you agree?

Maybe as part of #6447 we can add some sort of workaround to switch to vec![] to preserve the existing behavior until the next breaking release when we can change the behavior

Sounds good. I already have the second half planned for breaking changes. I can add the workaround to the list of nonbreaking changes in #6447.

etseidl and others added 2 commits September 20, 2024 14:56

add ParquetMetaDataReader

2cc99ed

Merge branch 'apache:master' into pq_metadata_reader

8d29173

github-actions bot added the parquet Changes to the parquet crate label Sep 20, 2024

etseidl commented Sep 20, 2024

View reviewed changes

etseidl added 2 commits September 20, 2024 15:06

clippy

d3136ec

Merge branch 'pq_metadata_reader' of github.com:etseidl/arrow-rs into…

41a92b1

… pq_metadata_reader

alamb reviewed Sep 23, 2024

View reviewed changes

etseidl and others added 7 commits September 23, 2024 13:20

Apply suggestions from code review

97f5d69

Co-authored-by: Andrew Lamb <[email protected]>

formatting

33bfd88

Merge remote-tracking branch 'origin/master' into pq_metadata_reader

503ec85

add ParquetMetaDataReader to module documentation

5baf071

document erros returned from try_parse_sized

a4943a1

oops

3bc8232

rename methods per review suggestion

43d01d1

Merge remote-tracking branch 'origin/master' into pq_metadata_reader

7d95bbb

alamb approved these changes Sep 24, 2024

View reviewed changes

alamb mentioned this pull request Sep 24, 2024

Example of reading and writing parquet metadata outside the file #6081

Merged

etseidl mentioned this pull request Sep 24, 2024

Deprecate and replace old Parquet metadata parsing functions #6447

Open

6 tasks

alamb merged commit e67f17e into apache:master Sep 24, 2024
17 of 18 checks passed

etseidl deleted the pq_metadata_reader branch September 24, 2024 23:05

This was referenced Sep 26, 2024

Error: missing required field ColumnIndex.null_pages when loading page indexes #6464

Open

API for encoding/decoding ParquetMetadata with more control #6002

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `ParquetMetaDataReader` #6431

Add `ParquetMetaDataReader` #6431

etseidl commented Sep 20, 2024 •

edited by alamb

Loading

etseidl Sep 20, 2024

alamb Sep 24, 2024

alamb left a comment

alamb Sep 23, 2024

adriangb Sep 24, 2024 •

edited

Loading

alamb Sep 23, 2024

etseidl Sep 23, 2024

alamb Sep 23, 2024

etseidl Sep 23, 2024

etseidl commented Sep 23, 2024

alamb left a comment

etseidl commented Sep 24, 2024 •

edited

Loading

etseidl commented Sep 24, 2024

alamb commented Sep 24, 2024

alamb commented Sep 24, 2024

etseidl commented Sep 24, 2024

Add ParquetMetaDataReader #6431

Add ParquetMetaDataReader #6431

Conversation

etseidl commented Sep 20, 2024 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

etseidl Sep 20, 2024

Choose a reason for hiding this comment

alamb Sep 24, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Sep 23, 2024

Choose a reason for hiding this comment

adriangb Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Sep 23, 2024

Choose a reason for hiding this comment

etseidl Sep 23, 2024

Choose a reason for hiding this comment

alamb Sep 23, 2024

Choose a reason for hiding this comment

etseidl Sep 23, 2024

Choose a reason for hiding this comment

etseidl commented Sep 23, 2024

alamb left a comment

Choose a reason for hiding this comment

etseidl commented Sep 24, 2024 • edited Loading

etseidl commented Sep 24, 2024

alamb commented Sep 24, 2024

alamb commented Sep 24, 2024

etseidl commented Sep 24, 2024

Add `ParquetMetaDataReader` #6431

Add `ParquetMetaDataReader` #6431

etseidl commented Sep 20, 2024 •

edited by alamb

Loading

adriangb Sep 24, 2024 •

edited

Loading

etseidl commented Sep 24, 2024 •

edited

Loading