Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading multi-source compressed JSONL files #17161

Merged
merged 40 commits into from
Nov 18, 2024

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Oct 23, 2024

Description

Fixes #17068
Fixes #12299

This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.

TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.
Handled by #17219, which implements compressed JSON writer required for the above test.
Multi-source compressed input tests added!

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 23, 2024
@shrshi shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2024
@shrshi shrshi marked this pull request as ready for review October 25, 2024 18:21
@shrshi shrshi requested a review from a team as a code owner October 25, 2024 18:21
cpp/src/io/comp/uncomp.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the approach!
Some suggestions, main one is the "memoization" in the new source type.

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved
@shrshi
Copy link
Contributor Author

shrshi commented Oct 31, 2024

Status update: Implementing benchmark for compressed JSONL files in #17219

@shrshi
Copy link
Contributor Author

shrshi commented Nov 6, 2024

Plot of throughput of read_json for GZIP-compressed JSONL inputs against uncompressed input size generated using the benchmark from #17219
compressed-gzip-io

Benchmark raw output:

<style> </style>
compression_type io data_size Samples CPU Time Noise GPU Time Noise bytes_per_second peak_memory_usage encoded_file_size
------------------ ---------- ------------------ --------- ------------ ------- ------------ ------- ------------------ ------------------- -------------------
GZIP FILEPATH 2^20 = 1048576 768x 18.205 ms 1.88% 18.196 ms 1.88% 2.95E+10 13.458 MiB 386.724 KiB
GZIP FILEPATH 2^21 = 2097152 617x 24.215 ms 3.04% 24.206 ms 3.04% 2.22E+10 26.707 MiB 768.517 KiB
GZIP FILEPATH 2^22 = 4194304 418x 35.741 ms 2.15% 35.732 ms 2.15% 1.5E+10 53.015 MiB 1.492 MiB
GZIP FILEPATH 2^23 = 8388608 251x 59.739 ms 1.60% 59.731 ms 1.60% 8.99E+09 106.061 MiB 2.984 MiB
GZIP FILEPATH 2^24 = 16777216 120x 125.038 ms 0.96% 125.029 ms 0.96% 4.29E+09 212.568 MiB 5.981 MiB
GZIP FILEPATH 2^25 = 33554432 5x 244.733 ms 0.37% 244.726 ms 0.37% 2.19E+09 425.201 MiB 11.965 MiB
GZIP FILEPATH 2^26 = 67108864 5x 481.785 ms 0.39% 481.781 ms 0.39% 1.11E+09 849.693 MiB 23.925 MiB
GZIP FILEPATH 2^27 = 134217728 5x 939.229 ms 0.23% 939.231 ms 0.23% 5.72E+08 1.658 GiB 47.797 MiB
GZIP FILEPATH 2^28 = 268435456 5x 1.923 s 0.28% 1.923 s 0.28% 2.79E+08 3.318 GiB 95.642 MiB
GZIP FILEPATH 2^29 = 536870912 4x 3.885 s inf% 3.885 s inf% 1.38E+08 6.638 GiB 191.359 MiB

Without the custom data source for compressed inputs, the parser errors out for all of the above data sizes. I suspect that the heuristic for compressed inputs in the previous implementation falls short without the retry policy, but some more investigation is required to understand the failure, especially for the smaller data sizes.

@shrshi
Copy link
Contributor Author

shrshi commented Nov 12, 2024

Thoughts on comparing the performance of the new feature against reading files individually and concatenating the results?
Not required before merging, but we should know this before we can recommend users to pass compressed files in batches.

I would expect a performance improvement since we need not run the tokenizer + tree algorithms repeatedly for each compressed file (up to a ~2GB total uncompressed source size limit per batch). I can modify the benchmark in #17219 to measure the exact impact.

@shrshi shrshi requested a review from a team as a code owner November 13, 2024 07:19
@github-actions github-actions bot added the CMake CMake build issue label Nov 13, 2024
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
Just some mild confusion about the test code changes.

cpp/src/io/comp/comp.hpp Outdated Show resolved Hide resolved
cpp/src/io/comp/io_uncomp.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't mind merging this file and comp.hpp. Definitely not required in this PR :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open a follow-on PR for this

cpp/tests/io/json/json_utils.cuh Show resolved Hide resolved
@@ -980,27 +980,27 @@ __device__ int parse_gzip_header(uint8_t const* src, size_t src_size)
{
uint8_t flags = src[3];
hdr_len = 10;
if (flags & GZIPHeaderFlag::fextra) // Extra fields present
if (flags & detail::GZIPHeaderFlag::fextra) // Extra fields present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like another file missing namespace detail ? Not for this PR :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yep, it looks like all the header files in src/io/comp need to be moved to detail namespace -

brotli_dict.hpp
unbz2.hpp
nvcomp_adapter.cuh (?)
nvcomp_adapter.hpp (?)
gpuinflate.hpp
brotli_tables.hpp

Haha yes, will defer this to a follow-on PR :D

Copy link
Contributor

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved trivial CMake changes

@shrshi
Copy link
Contributor Author

shrshi commented Nov 18, 2024

/merge

@rapids-bot rapids-bot bot merged commit 03ac845 into rapidsai:branch-24.12 Nov 18, 2024
102 checks passed
rapids-bot bot pushed a commit that referenced this pull request Nov 19, 2024
Depends on #17161 for implementations of compression and decompression functions (`io/comp/comp.cu`, `io/comp/comp.hpp`, `io/comp/io_uncomp.hpp` and `io/comp/uncomp.cpp`)

Adds support for writing GZIP- and SNAPPY-compressed JSON to the JSON writer.
Verifies correctness using a parameterized test in `tests/io/json/json_writer.cpp`

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Kyle Edwards (https://github.com/KyleFromNVIDIA)
  - Karthikeyan (https://github.com/karthikeyann)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #17323
rapids-bot bot pushed a commit that referenced this pull request Nov 20, 2024
Depends on #17161 for implementations of compression and decompression functions (`io/comp/comp.cu`, `io/comp/comp.hpp`, `io/comp/io_uncomp.hpp` and `io/comp/uncomp.cpp`)\
Depends on #17323 for compressed JSON writer implementation.

Adds benchmark to measure performance of the JSON reader for compressed inputs.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - MithunR (https://github.com/mythrocks)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #17219
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
5 participants