-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ParquetMetaDataReader
#6431
Conversation
// TODO(ets): what is the correct behavior for missing page indexes? MetadataLoader would | ||
// leave them as `None`, while the parser in `index_reader::read_columns_indexes` returns a | ||
// vector of empty vectors. | ||
// I think it's best to leave them as `None`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the remaining outstanding issue from #6392. My preference is to leave the page indexes as None
if they are not present, but this differs from current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree let's leave them as None if they aren't present.
… pq_metadata_reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @etseidl -- I think this looks amazing 😍 . It is really nicely documented and structured.
I got about 1/3 of the way through and will return to it tomorrow.
cc @adriangb and @XiangpengHao who I think are also interested in these apis
It would be cool (we can do it as a follow on PR), to update the "table of content" type of documentation here:
|
||
/// Given a [`ChunkReader`], parse and return the [`ParquetMetaData`] in a single pass. | ||
/// | ||
/// If `reader` is [`Bytes`] based, then the buffer must contain sufficient bytes to complete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be really neat if @adriangb could comment / provide an example of using this API even when we don't have the entire file (aka faking out the offsets). Definitely as a follow on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A summary:
- You have the parquet bytes in memory. From that you load the
ParquetMetaData
including page indexes. - Use the new writer to write
ParquetMetadata
to bytes and store the bytes in a K/V store. - When I want to load this file again I start by getting the metadata bytes from the K/V store.
- I decode those bytes using this new API, passing in the original file size to adjust the offsets.
- I now have a
ParquetMetaData
in memory that I can pass back so that upstream things can decide if they even need to read the rest of the file or what pages in the file they need to read.
This should be very similar to what's going on in #6081 but I'm a bit confused about what's going on there
parquet/src/file/metadata/reader.rs
Outdated
/// .with_page_indexes(true) | ||
/// .parse(&file).unwrap(); | ||
/// ``` | ||
pub fn parse<R: ChunkReader>(mut self, reader: &R) -> Result<ParquetMetaData> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the fact that parse consumed self was slightly confusing given try_parse did not.
Maybe we could call it build()
or parse_and_finish
to reflect it returns Self 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I'll opt for parse_and_finish
(and load_and_finish
below). Happy to change if there's a better suggestion.
/// is a [`Bytes`] struct that does not contain the entire file. This information is necessary | ||
/// when the page indexes are desired. `reader` must have access to the Parquet footer. | ||
/// | ||
/// Using this function also allows for retrying with a larger buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help here to document what errors are returned (specifically how "needs more buffer" is communicated)
I see it is partly covered in the example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Co-authored-by: Andrew Lamb <[email protected]>
FWIW CI errors appear unrelated to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I think this looks great. Thanks again @etseidl
I also think that the POC done in #6392 gives me confidence that this API is sufficient to replace the existing uses in the crate with this new structure.
I am also quite excited about the potential usecases this API opens up (like potentially being able to only create Rust structures for the metadata for certain row groups)
I updated my "store metadata outside the file" example with this API here: #6081 and it was 👨🍳 👌 very nice.
I wonder if we should list out somewhere on a ticket the APIs that should be consolidated / deprecated as follow on PRs (e.g. decode_footer
, decode_metadata
, etc)?
Thanks @alamb 🙏
Yes, that's a good idea. The two obvious places are Added #6447 |
@alamb I am still concerned about #6431 (comment). I think it's fine for this new API to follow what |
Given that the POC has been open for a while as has this PR, I don't think there is any reason to hold off merging, so here we go! Thanks again @etseidl |
Maybe as part of #6447 we can add some sort of workaround to switch to |
Sounds good. I already have the second half planned for breaking changes. I can add the workaround to the list of nonbreaking changes in #6447. |
Which issue does this PR close?
Part of #6002.
Rationale for this change
Consolidate Parquet metadata parsing into a single API. See discussion in #6392 for additional context.
What changes are included in this PR?
Adds the
ParquetMetaDataReader
struct.Are there any user-facing changes?
No