-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading multi-source compressed JSONL files #17161
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the approach!
Some suggestions, main one is the "memoization" in the new source type.
Status update: Implementing benchmark for compressed JSONL files in #17219 |
Plot of throughput of Benchmark raw output: <style> </style>
Without the custom data source for compressed inputs, the parser errors out for all of the above data sizes. I suspect that the heuristic for compressed inputs in the previous implementation falls short without the retry policy, but some more investigation is required to understand the failure, especially for the smaller data sizes. |
I would expect a performance improvement since we need not run the tokenizer + tree algorithms repeatedly for each compressed file (up to a ~2GB total uncompressed source size limit per batch). I can modify the benchmark in #17219 to measure the exact impact. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Just some mild confusion about the test code changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't mind merging this file and comp.hpp
. Definitely not required in this PR :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open a follow-on PR for this
@@ -980,27 +980,27 @@ __device__ int parse_gzip_header(uint8_t const* src, size_t src_size) | |||
{ | |||
uint8_t flags = src[3]; | |||
hdr_len = 10; | |||
if (flags & GZIPHeaderFlag::fextra) // Extra fields present | |||
if (flags & detail::GZIPHeaderFlag::fextra) // Extra fields present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like another file missing namespace detail
? Not for this PR :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yep, it looks like all the header files in src/io/comp
need to be moved to detail namespace -
brotli_dict.hpp
unbz2.hpp
nvcomp_adapter.cuh (?)
nvcomp_adapter.hpp (?)
gpuinflate.hpp
brotli_tables.hpp
Haha yes, will defer this to a follow-on PR :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved trivial CMake changes
/merge |
Depends on #17161 for implementations of compression and decompression functions (`io/comp/comp.cu`, `io/comp/comp.hpp`, `io/comp/io_uncomp.hpp` and `io/comp/uncomp.cpp`) Adds support for writing GZIP- and SNAPPY-compressed JSON to the JSON writer. Verifies correctness using a parameterized test in `tests/io/json/json_writer.cpp` Authors: - Shruti Shivakumar (https://github.com/shrshi) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) URL: #17323
Depends on #17161 for implementations of compression and decompression functions (`io/comp/comp.cu`, `io/comp/comp.hpp`, `io/comp/io_uncomp.hpp` and `io/comp/uncomp.cpp`)\ Depends on #17323 for compressed JSON writer implementation. Adds benchmark to measure performance of the JSON reader for compressed inputs. Authors: - Shruti Shivakumar (https://github.com/shrshi) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #17219
Description
Fixes #17068
Fixes #12299
This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.
TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.Handled by #17219, which implements compressed JSON writer required for the above test.Multi-source compressed input tests added!
Checklist