Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add decoder for DELTA_BYTE_ARRAY to Parquet reader #14101

Merged
merged 88 commits into from
Nov 16, 2023

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Sep 13, 2023

Description

Part of #13501. Adds ability to decode DELTA_BYTE_ARRAY encoded pages.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 13, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 13, 2023
@etseidl
Copy link
Contributor Author

etseidl commented Sep 15, 2023

This one cannot be tested with python in same way the DELTA_BINARY_PACKED decoder is because pyarrow doesn't yet support writing DELTA_BYTE_ARRAY (as of version 13.0, it will be in 14.0). The only option at this point is to add yet another small parquet file to the python test data directory, but that's not a very thorough test.

@etseidl etseidl changed the base branch from branch-23.10 to branch-23.12 September 27, 2023 16:49
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 29, 2023
@etseidl etseidl marked this pull request as ready for review September 29, 2023 22:37
@etseidl etseidl requested a review from a team as a code owner September 29, 2023 22:37
cpp/src/io/parquet/page_data.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Show resolved Hide resolved
cpp/src/io/parquet/page_string_decode.cu Show resolved Hide resolved
cpp/src/io/parquet/page_delta_decode.cu Show resolved Hide resolved
cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/delta_binary.cuh Show resolved Hide resolved
cpp/src/io/parquet/page_delta_decode.cu Show resolved Hide resolved
@etseidl etseidl requested a review from nvdbaranec November 13, 2023 18:10
@vuule
Copy link
Contributor

vuule commented Nov 14, 2023

Added some more tests now that we've moved to arrow 14. Should we think about reverting #11480 so we can test skip rows, or just wait for the encoder? @vuule

We probably should, since we already maintain the whole feature in libcudf. I'll let you know after an internal discussion.

@vuule vuule self-assigned this Nov 14, 2023
@vuule
Copy link
Contributor

vuule commented Nov 14, 2023

/ok to test

cpp/src/io/parquet/page_delta_decode.cu Show resolved Hide resolved
} else { // warp 3
target_pos = min(s->nz_count, src_pos + batch_size);
}
__syncthreads();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sync necessary? There's one already at the bottom of the loop and there's nothing here that seems to require synchronization. I just looked at the related loop in page_data.cu and we're doing it there as well. Hmm. This feels very vestigial.

Copy link
Contributor Author

@etseidl etseidl Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never thought about it, but I think you're right. Maybe back in the day target_pos was shared or something. I'll take a look at all the decoders.

Edit: Did some archaeology and that sync has been there since the decoder was reworked in 2019, and near as I can tell it wasn't needed then either. I've removed them from the 4 current decoders.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, let's leave them in for now and remove them in a separate PR. I'd rather let it soak for a full milestone (until 24.02) to make sure there's not something subtle here that we're missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'll put them back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done...and added some TODOs

cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_string_decode.cu Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented Nov 15, 2023

/ok to test

@vuule
Copy link
Contributor

vuule commented Nov 16, 2023

/ok to test

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left an optional comment here: https://github.com/rapidsai/cudf/pull/14101/files#r1396338749

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 16, 2023
@vuule
Copy link
Contributor

vuule commented Nov 16, 2023

/merge

@rapids-bot rapids-bot bot merged commit bf63d10 into rapidsai:branch-23.12 Nov 16, 2023
65 checks passed
@etseidl etseidl deleted the decode_delta_byte_arr branch November 16, 2023 23:09
etseidl added a commit to etseidl/cudf that referenced this pull request Dec 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants