Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DELTA_BINARY_PACKED encoder for Parquet writer #14100

Merged
merged 62 commits into from
Oct 20, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Sep 13, 2023

Description

Part of #13501. Adds ability to fall back on DELTA_BINARY_PACKED encoding when V2 page headers are selected and dictionary encoding is not possible.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 13, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 13, 2023
@etseidl etseidl marked this pull request as ready for review September 14, 2023 00:00
@etseidl etseidl requested a review from a team as a code owner September 14, 2023 00:00
@etseidl etseidl requested review from vyasr and divyegala September 14, 2023 00:00
@PointKernel PointKernel added cuIO cuIO issue 4 - Needs cuIO Reviewer non-breaking Non-breaking change feature request New feature or request labels Sep 18, 2023
@PointKernel
Copy link
Member

/ok to test

constexpr int values_per_mini_block = block_size / num_mini_blocks;
constexpr int buffer_size = 2 * block_size;

// extra sanity checks to enforce compliance with the parquet specification
Copy link
Contributor

@ttnghia ttnghia Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It is cleaner to have any sentence (including comments) begins with a capitalized letter and ends with a period. Sorry if I'm nitpicking too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries...hitting caps is an extra keystroke and I'm lazy 😅

Comment on lines 133 to 138
uint8_t* _dst; // sink to dump encoded values to
size_type _current_idx; // index of first value in buffer
uint32_t _num_values; // total number of values to encode
size_type _values_in_buffer; // current number of values stored in _buffer
T* _buffer; // buffer to store values to be encoded
uint8_t _mb_bits[delta::num_mini_blocks]; // bitwidth for each mini-block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you group pointers and values please?

Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I just left a few comments about coding style.

@vuule vuule requested a review from ttnghia October 18, 2023 21:20
@ttnghia
Copy link
Contributor

ttnghia commented Oct 18, 2023

The tests here seem to do round trip write/read. Can we have file written somewhere else that can be tested here? I'm not sure if we can do that in C++ code or have to do it in python?

@etseidl
Copy link
Contributor Author

etseidl commented Oct 18, 2023

The tests here seem to do round trip write/read. Can we have file written somewhere else that can be tested here? I'm not sure if we can do that in C++ code or have to do it in python?

So the inverse of test_delta_binary in test_parquet.py, which writes with pandas and reads with cudf? So write with cudf and read with pyarrow or pandas? The issue is we don't yet expose the options necessary to write V2 with python. Can we use the arrow compatibility stuff? So on the C++ side, write with cudf, then read with arrow-cpp, convert arrow to cudf and compare?

@vuule
Copy link
Contributor

vuule commented Oct 18, 2023

The tests here seem to do round trip write/read. Can we have file written somewhere else that can be tested here? I'm not sure if we can do that in C++ code or have to do it in python?

So the inverse of test_delta_binary in test_parquet.py, which writes with pandas and reads with cudf? So write with cudf and read with pyarrow or pandas? The issue is we don't yet expose the options necessary to write V2 with python. Can we use the arrow compatibility stuff? So on the C++ side, write with cudf, then read with arrow-cpp, convert arrow to cudf and compare?

FWIW, we do have reader tests for this encoding that does not depend on the cudf writer. I guess it does not prove that reader only reads correctly encoded data, so writer could write something only cudf can read. Might be best to add the V2 API to Python...

@ttnghia
Copy link
Contributor

ttnghia commented Oct 19, 2023

Might be best to add the V2 API to Python...

Agree. The test can be added here or either in a separate PR.

@ttnghia
Copy link
Contributor

ttnghia commented Oct 20, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Oct 20, 2023

/merge

@rapids-bot rapids-bot bot merged commit f7ad66f into rapidsai:branch-23.12 Oct 20, 2023
54 checks passed
@etseidl etseidl deleted the encode_delta_binary branch October 20, 2023 16:45
rapids-bot bot pushed a commit that referenced this pull request Nov 8, 2023
During the review of #14100 there was a suggestion to add a test of writing using cudf and then reading the resulting file back with pyarrow. This PR adds the necessary python bindings to perform this test.

NOTE: there is currently an issue with encoding 32-bit values where the deltas exceed 32-bits. parquet-mr and arrow truncate the deltas for the INT32 physical type and allow values to overflow, whereas cudf currently uses 64-bit deltas, which avoids the overflow, but can result in requiring 33-bits when encoding. The current cudf behavior is allowed by the specification (and in fact is readable by parquet-mr), but using the extra bit is not in the Parquet spirit of least output file size. This will be addressed in follow-on work.

Authors:
  - Ed Seidl (https://github.com/etseidl)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - MithunR (https://github.com/mythrocks)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #14316
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Needs Review Waiting for reviewer to review or respond cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants