Reading multi-source compressed JSONL files #17161

shrshi · 2024-10-23T23:58:59Z

Description

This PR introduces a new datasource for compressed inputs which enables batching and byte range reading of multi-source JSONL files using the reallocate-and-retry policy. Moreover. instead of using a 4:1 compression ratio heuristic, the device buffer size is estimated accurately for GZIP, ZIP, and SNAPPY compression types. For remaining types, the files are first decompressed then batched.

~~TODO: Reuse existing JSON tests but with an additional compression parameter to verify correctness.~~
~~Handled by #17219, which implements compressed JSON writer required for the above test.~~
Multi-source compressed input tests added!

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

cpp/src/io/json/read_json.cu

cpp/src/io/comp/uncomp.cpp

vuule

I love the approach!
Some suggestions, main one is the "memoization" in the new source type.

cpp/src/io/json/read_json.cu

shrshi · 2024-10-31T18:13:37Z

Status update: Implementing benchmark for compressed JSONL files in #17219

shrshi · 2024-11-06T13:40:05Z

Plot of throughput of read_json for GZIP-compressed JSONL inputs against uncompressed input size generated using the benchmark from #17219

Benchmark raw output:

compression_type	io	data_size	Samples	CPU Time	Noise	GPU Time	Noise	bytes_per_second	peak_memory_usage	encoded_file_size
------------------	----------	------------------	---------	------------	-------	------------	-------	------------------	-------------------	-------------------
GZIP	FILEPATH	2^20 = 1048576	768x	18.205 ms	1.88%	18.196 ms	1.88%	2.95E+10	13.458 MiB	386.724 KiB
GZIP	FILEPATH	2^21 = 2097152	617x	24.215 ms	3.04%	24.206 ms	3.04%	2.22E+10	26.707 MiB	768.517 KiB
GZIP	FILEPATH	2^22 = 4194304	418x	35.741 ms	2.15%	35.732 ms	2.15%	1.5E+10	53.015 MiB	1.492 MiB
GZIP	FILEPATH	2^23 = 8388608	251x	59.739 ms	1.60%	59.731 ms	1.60%	8.99E+09	106.061 MiB	2.984 MiB
GZIP	FILEPATH	2^24 = 16777216	120x	125.038 ms	0.96%	125.029 ms	0.96%	4.29E+09	212.568 MiB	5.981 MiB
GZIP	FILEPATH	2^25 = 33554432	5x	244.733 ms	0.37%	244.726 ms	0.37%	2.19E+09	425.201 MiB	11.965 MiB
GZIP	FILEPATH	2^26 = 67108864	5x	481.785 ms	0.39%	481.781 ms	0.39%	1.11E+09	849.693 MiB	23.925 MiB
GZIP	FILEPATH	2^27 = 134217728	5x	939.229 ms	0.23%	939.231 ms	0.23%	5.72E+08	1.658 GiB	47.797 MiB
GZIP	FILEPATH	2^28 = 268435456	5x	1.923 s	0.28%	1.923 s	0.28%	2.79E+08	3.318 GiB	95.642 MiB
GZIP	FILEPATH	2^29 = 536870912	4x	3.885 s	inf%	3.885 s	inf%	1.38E+08	6.638 GiB	191.359 MiB

Without the custom data source for compressed inputs, the parser errors out for all of the above data sizes. I suspect that the heuristic for compressed inputs in the previous implementation falls short without the retry policy, but some more investigation is required to understand the failure, especially for the smaller data sizes.

shrshi · 2024-11-12T15:00:09Z

Thoughts on comparing the performance of the new feature against reading files individually and concatenating the results?
Not required before merging, but we should know this before we can recommend users to pass compressed files in batches.

I would expect a performance improvement since we need not run the tokenizer + tree algorithms repeatedly for each compressed file (up to a ~2GB total uncompressed source size limit per batch). I can modify the benchmark in #17219 to measure the exact impact.

…-read-json-bug

vuule

Looks good!
Just some mild confusion about the test code changes.

cpp/src/io/comp/comp.hpp

cpp/src/io/comp/io_uncomp.hpp

vuule · 2024-11-15T00:12:13Z

cpp/src/io/comp/io_uncomp.hpp

wouldn't mind merging this file and comp.hpp. Definitely not required in this PR :)

Will open a follow-on PR for this

cpp/tests/io/json/json_utils.cuh

vuule · 2024-11-15T06:37:22Z

cpp/src/io/comp/gpuinflate.cu

@@ -980,27 +980,27 @@ __device__ int parse_gzip_header(uint8_t const* src, size_t src_size)
    {
      uint8_t flags = src[3];
      hdr_len       = 10;
-      if (flags & GZIPHeaderFlag::fextra)  // Extra fields present
+      if (flags & detail::GZIPHeaderFlag::fextra)  // Extra fields present


looks like another file missing namespace detail ? Not for this PR :D

Ah yep, it looks like all the header files in src/io/comp need to be moved to detail namespace -

brotli_dict.hpp unbz2.hpp nvcomp_adapter.cuh (?) nvcomp_adapter.hpp (?) gpuinflate.hpp brotli_tables.hpp

Haha yes, will defer this to a follow-on PR :D

KyleFromNVIDIA

Approved trivial CMake changes

shrshi · 2024-11-18T17:39:00Z

/merge

Depends on #17161 for implementations of compression and decompression functions (`io/comp/comp.cu`, `io/comp/comp.hpp`, `io/comp/io_uncomp.hpp` and `io/comp/uncomp.cpp`) Adds support for writing GZIP- and SNAPPY-compressed JSON to the JSON writer. Verifies correctness using a parameterized test in `tests/io/json/json_writer.cpp` Authors: - Shruti Shivakumar (https://github.com/shrshi) - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Karthikeyan (https://github.com/karthikeyann) - Vukasin Milovanovic (https://github.com/vuule) URL: #17323

Depends on #17161 for implementations of compression and decompression functions (`io/comp/comp.cu`, `io/comp/comp.hpp`, `io/comp/io_uncomp.hpp` and `io/comp/uncomp.cpp`)\ Depends on #17323 for compressed JSON writer implementation. Adds benchmark to measure performance of the JSON reader for compressed inputs. Authors: - Shruti Shivakumar (https://github.com/shrshi) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - MithunR (https://github.com/mythrocks) - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #17219

shrshi added 3 commits October 18, 2024 20:11

partial work

3517b26

compressed input datasource

1cc6f46

formatting

1f46223

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 23, 2024

shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 24, 2024

shrshi added 6 commits October 24, 2024 00:57

improving the datasoruce

334ef06

cleanup

839bdda

slow path for some compression formats

42a4b1b

merge

cff583b

cleanup

c3b6cb3

remove include

e116fa7

vuule reviewed Oct 24, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Show resolved Hide resolved

shrshi added 2 commits October 25, 2024 18:19

pr feedback

3cd7c1d

Merge branch 'branch-24.12' into gzip-read-json-bug

dc7471c

shrshi marked this pull request as ready for review October 25, 2024 18:21

shrshi requested a review from a team as a code owner October 25, 2024 18:21

shrshi requested review from karthikeyann and kingcrimsontianyu October 25, 2024 18:21

vuule reviewed Oct 25, 2024

View reviewed changes

cpp/src/io/comp/uncomp.cpp Outdated Show resolved Hide resolved

vuule requested changes Oct 25, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

ttnghia reviewed Oct 28, 2024

View reviewed changes

cpp/src/io/json/read_json.cu Outdated Show resolved Hide resolved

GregoryKimball assigned shrshi Nov 4, 2024

shrshi added 4 commits November 6, 2024 12:04

reorg and cleanup

4bd817e

update function name

51a382e

merge

ac596a2

pr reviews

9ae4c78

shrshi added 2 commits November 13, 2024 07:17

cpp tests for compressed json io

6fd47ca

formatting

5753c62

shrshi requested a review from a team as a code owner November 13, 2024 07:19

github-actions bot added the CMake CMake build issue label Nov 13, 2024

shrshi added 7 commits November 13, 2024 10:36

modified byte range tests

c95ea76

fixing zip uncomp

0bb1b45

moving test

c667552

removing unnecessary file

971d32c

cleanup

3ab020a

large strings snappy test failure

aec45a6

Merge branch 'branch-24.12' into gzip-read-json-bug

998265a

shrshi requested a review from karthikeyann November 13, 2024 12:13

shrshi mentioned this pull request Nov 14, 2024

Writing compressed output using JSON writer #17323

Merged

3 tasks

shrshi added 4 commits November 14, 2024 15:47

rename cu -> cpp

a0e3069

Merge branch 'gzip-read-json-bug' of github.com:shrshi/cudf into gzip…

0d210af

…-read-json-bug

minor fix

0c905d3

formatting

27af753

vuule reviewed Nov 15, 2024

View reviewed changes

shrshi added 2 commits November 15, 2024 05:32

partial work

e22684f

pr reviews - moving comp and uncomp to detail namespace

7460756

vuule reviewed Nov 15, 2024

View reviewed changes

vuule approved these changes Nov 15, 2024

View reviewed changes

KyleFromNVIDIA approved these changes Nov 18, 2024

View reviewed changes

karthikeyann approved these changes Nov 18, 2024

View reviewed changes

rapids-bot bot merged commit 03ac845 into rapidsai:branch-24.12 Nov 18, 2024
102 checks passed

shrshi mentioned this pull request Dec 1, 2024

[BUG] libcudf JSON reader crash with compressed data #16248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading multi-source compressed JSONL files #17161

Reading multi-source compressed JSONL files #17161

shrshi commented Oct 23, 2024 •

edited

Loading

vuule left a comment

shrshi commented Oct 31, 2024

shrshi commented Nov 6, 2024 •

edited

Loading

shrshi commented Nov 12, 2024 •

edited

Loading

vuule left a comment

vuule Nov 15, 2024

shrshi Nov 15, 2024

vuule Nov 15, 2024

shrshi Nov 15, 2024

KyleFromNVIDIA left a comment

shrshi commented Nov 18, 2024

Reading multi-source compressed JSONL files #17161

Reading multi-source compressed JSONL files #17161

Conversation

shrshi commented Oct 23, 2024 • edited Loading

Description

Checklist

vuule left a comment

Choose a reason for hiding this comment

shrshi commented Oct 31, 2024

shrshi commented Nov 6, 2024 • edited Loading

shrshi commented Nov 12, 2024 • edited Loading

vuule left a comment

Choose a reason for hiding this comment

vuule Nov 15, 2024

Choose a reason for hiding this comment

shrshi Nov 15, 2024

Choose a reason for hiding this comment

vuule Nov 15, 2024

Choose a reason for hiding this comment

shrshi Nov 15, 2024

Choose a reason for hiding this comment

KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

shrshi commented Nov 18, 2024

shrshi commented Oct 23, 2024 •

edited

Loading

shrshi commented Nov 6, 2024 •

edited

Loading

shrshi commented Nov 12, 2024 •

edited

Loading