Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading bloom filters from Parquet files and filter row groups using them #17289

Open
wants to merge 96 commits into
base: branch-25.02
Choose a base branch
from

Conversation

mhaseeb123
Copy link
Member

@mhaseeb123 mhaseeb123 commented Nov 9, 2024

Description

This PR adds support to read bloom filters from Parquet files and use them to filter row groups based on col == literal like predicate(s), if provided.

Related to #17164

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Nov 9, 2024
@mhaseeb123 mhaseeb123 added 2 - In Progress Currently a work in progress cuIO cuIO issue cuco cuCollections related issue feature request New feature or request non-breaking Non-breaking change labels Nov 9, 2024
@vuule vuule requested review from bdice and vuule December 11, 2024 17:55
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review, flushing the small comments I've got so far

cpp/src/io/parquet/reader_impl_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ast_operator::NOT, _bloom_filter_expr.push(ast::operation{ast_operator::NOT, value})});
}
// For all other expressions, push an always true expression
else {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karthikeyann @vuule added this logic handle any non col == lit type expressions in the filter. Essentially just transforming them all to always true.

* @brief Collects lists of equality predicate literals in the AST expression, one list per input
* table column. This is used in row group filtering based on bloom filters.
*/
class equality_literals_collector : public ast::detail::expression_transformer {
Copy link
Member Author

@mhaseeb123 mhaseeb123 Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for an ast::tree in this expression converter as we only visit and collect literals for col == lit expressions.

*/
std::reference_wrapper<ast::expression const> visit(ast::literal const& expr) override
{
return expr;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to push any of these to the ast::tree from the child class bloom_filter_expression_converter either as these columns or literals don't participate in the transformed expression tree.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few small comments, looks good overall :)

cpp/src/io/parquet/bloom_filter_reader.cu Show resolved Hide resolved
cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/bloom_filter_reader.cu Outdated Show resolved Hide resolved
@mhaseeb123 mhaseeb123 added 4 - Needs Review Waiting for reviewer to review or respond and removed 3 - Ready for Review Ready for review by team labels Dec 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue cuco cuCollections related issue cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants