Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding BloomFilter reading support to ParquetMetadataReader #6514

Open
alamb opened this issue Oct 6, 2024 · 3 comments
Open

Consider adding BloomFilter reading support to ParquetMetadataReader #6514

alamb opened this issue Oct 6, 2024 · 3 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@alamb
Copy link
Contributor

alamb commented Oct 6, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Parquet now has the wonderful ParquetMetaDataReader structure from @adriangb and @etseidl

This handles reading the footer metadata as well as the page indexes.

@progval noted in #6505 (comment) that BloomFilters are similiar to the PageIndex, but are not currently read/written by the ParquetMetaDataReader

Describe the solution you'd like
I would like to be able to configure the ParquetMetaDataReader (and writer) to read BloomFilters as well

Describe alternatives you've considered
This might look something like

// read parquet metadata including page indexes
let file = open_parquet_file("some_path.parquet");
let mut reader = ParquetMetaDataReader::new()
    .with_bloom_filters(true);
reader.try_parse(&file).unwrap();
let metadata = reader.finish().unwrap();
// Somehow get access to the bloom filters (not sure what that API would look like)

Additional context

@alamb alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels Oct 6, 2024
@etseidl
Copy link
Contributor

etseidl commented Oct 7, 2024

I can see two paths forward here. In the near term we could add convenience functions to ParquetMetaDataReader/Writer to allow fetching/writing the bloom filters given an existing ParquetMetaData.

Longer term, I think we'd want to move the bloom filters into the ParquetMetaData struct to enable @alamb's example above. This could be part of the larger refactoring of the metadata (#6129, #6097).

@alamb
Copy link
Contributor Author

alamb commented Oct 8, 2024

I can see two paths forward here. In the near term we could add convenience functions to ParquetMetaDataReader/Writer to allow fetching/writing the bloom filters given an existing ParquetMetaData.

Longer term, I think we'd want to move the bloom filters into the ParquetMetaData struct to enable @alamb's example above. This could be part of the larger refactoring of the metadata (#6129, #6097).

I agree with this breakdown -- I think the first (move the code to read/write bloom filters into ParquetMetaDataReader / ParquetMetaDataWriter) would likely be a step towards the second (and would consolidate the code in a single location), so if someone is interested in trying it that would be neat.

As for a major metadata revamp, that is a bit more disruptive in my mind as it would cause significant downstream churn. 🤔

@progval
Copy link
Contributor

progval commented Oct 9, 2024

@progval noted in #6505 (comment) that BloomFilters are similiar to the PageIndex, but are not currently read/written by the ParquetMetaDataReader

Not that similar though, because Bloom Filters are considerably larger and won't fit in RAM for a large dataset, so they shouldn't be cached (at least not by default)

@alamb alamb changed the title Consider adding BloomFilter reading support to Consider adding BloomFilter reading support to ParquetMetadataReader Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

No branches or pull requests

3 participants