Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ParquetMetaDataReader #6431

Merged
merged 12 commits into from
Sep 24, 2024
Merged

Add ParquetMetaDataReader #6431

merged 12 commits into from
Sep 24, 2024

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Sep 20, 2024

Which issue does this PR close?

Part of #6002.

Rationale for this change

Consolidate Parquet metadata parsing into a single API. See discussion in #6392 for additional context.

What changes are included in this PR?

Adds the ParquetMetaDataReader struct.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 20, 2024
Comment on lines +240 to +243
// TODO(ets): what is the correct behavior for missing page indexes? MetadataLoader would
// leave them as `None`, while the parser in `index_reader::read_columns_indexes` returns a
// vector of empty vectors.
// I think it's best to leave them as `None`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the remaining outstanding issue from #6392. My preference is to leave the page indexes as None if they are not present, but this differs from current behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree let's leave them as None if they aren't present.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @etseidl -- I think this looks amazing 😍 . It is really nicely documented and structured.

I got about 1/3 of the way through and will return to it tomorrow.

cc @adriangb and @XiangpengHao who I think are also interested in these apis

It would be cool (we can do it as a follow on PR), to update the "table of content" type of documentation here:

Screenshot 2024-09-23 at 3 59 19 PM


/// Given a [`ChunkReader`], parse and return the [`ParquetMetaData`] in a single pass.
///
/// If `reader` is [`Bytes`] based, then the buffer must contain sufficient bytes to complete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be really neat if @adriangb could comment / provide an example of using this API even when we don't have the entire file (aka faking out the offsets). Definitely as a follow on

Copy link
Contributor

@adriangb adriangb Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A summary:

  1. You have the parquet bytes in memory. From that you load the ParquetMetaData including page indexes.
  2. Use the new writer to write ParquetMetadata to bytes and store the bytes in a K/V store.
  3. When I want to load this file again I start by getting the metadata bytes from the K/V store.
  4. I decode those bytes using this new API, passing in the original file size to adjust the offsets.
  5. I now have a ParquetMetaData in memory that I can pass back so that upstream things can decide if they even need to read the rest of the file or what pages in the file they need to read.

This should be very similar to what's going on in #6081 but I'm a bit confused about what's going on there

parquet/src/file/metadata/reader.rs Outdated Show resolved Hide resolved
parquet/src/file/metadata/reader.rs Outdated Show resolved Hide resolved
/// .with_page_indexes(true)
/// .parse(&file).unwrap();
/// ```
pub fn parse<R: ChunkReader>(mut self, reader: &R) -> Result<ParquetMetaData> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the fact that parse consumed self was slightly confusing given try_parse did not.

Maybe we could call it build() or parse_and_finish to reflect it returns Self 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'll opt for parse_and_finish (and load_and_finish below). Happy to change if there's a better suggestion.

parquet/src/file/metadata/reader.rs Show resolved Hide resolved
/// is a [`Bytes`] struct that does not contain the entire file. This information is necessary
/// when the page indexes are desired. `reader` must have access to the Parquet footer.
///
/// Using this function also allows for retrying with a larger buffer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would help here to document what errors are returned (specifically how "needs more buffer" is communicated)

I see it is partly covered in the example

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@etseidl
Copy link
Contributor Author

etseidl commented Sep 23, 2024

FWIW CI errors appear unrelated to this PR.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR is I think this looks great. Thanks again @etseidl

I also think that the POC done in #6392 gives me confidence that this API is sufficient to replace the existing uses in the crate with this new structure.

I am also quite excited about the potential usecases this API opens up (like potentially being able to only create Rust structures for the metadata for certain row groups)

I updated my "store metadata outside the file" example with this API here: #6081 and it was 👨‍🍳 👌 very nice.

I wonder if we should list out somewhere on a ticket the APIs that should be consolidated / deprecated as follow on PRs (e.g. decode_footer, decode_metadata, etc)?

@etseidl
Copy link
Contributor Author

etseidl commented Sep 24, 2024

Thanks @alamb 🙏

I wonder if we should list out somewhere on a ticket the APIs that should be consolidated / deprecated as follow on PRs (e.g. decode_footer, decode_metadata, etc)?

Yes, that's a good idea. The two obvious places are footer.rs and MetadataLoader, but the arrow readers also have pockets of footer decoding in them. I'll go through #6392 and come up with a list that can go into an issue.

Added #6447

@etseidl
Copy link
Contributor Author

etseidl commented Sep 24, 2024

@alamb I am still concerned about #6431 (comment). I think it's fine for this new API to follow what MetadataLoader does (i.e. leave the page indexes in ParquetMetaData set to None if they are not present in the file), but elsewhere missing page indexes are left as empty Vecs. I believe changing this behavior will be breaking, so will have to be deferred until 54.0.0. Do you agree?

@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

Given that the POC has been open for a while as has this PR, I don't think there is any reason to hold off merging, so here we go!

Thanks again @etseidl

@alamb alamb merged commit e67f17e into apache:master Sep 24, 2024
17 of 18 checks passed
@alamb
Copy link
Contributor

alamb commented Sep 24, 2024

but elsewhere missing page indexes are left as empty Vecs. I believe changing this behavior will be breaking, so will have to be deferred until 54.0.0. Do you agree?

Maybe as part of #6447 we can add some sort of workaround to switch to vec![] to preserve the existing behavior until the next breaking release when we can change the behavior

@etseidl
Copy link
Contributor Author

etseidl commented Sep 24, 2024

but elsewhere missing page indexes are left as empty Vecs. I believe changing this behavior will be breaking, so will have to be deferred until 54.0.0. Do you agree?

Maybe as part of #6447 we can add some sort of workaround to switch to vec![] to preserve the existing behavior until the next breaking release when we can change the behavior

Sounds good. I already have the second half planned for breaking changes. I can add the workaround to the list of nonbreaking changes in #6447.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants