Support for progressive parquet chunked reading. #14079

nvdbaranec · 2023-09-11T22:04:20Z

Previously, the parquet chunked reader operated by controlling the size of output chunks only. It would still ingest the entire input file and decompress it, which can take up a considerable amount of memory. With this new 'progressive' support, we also 'chunk' at the input level. Specifically, the user can pass a pass_read_limit value which controls how much memory is used for storing compressed/decompressed data. The reader will make multiple passes over the file, reading as many row groups as it can to attempt to fit within this limit. Within each pass, chunks are emitted as before.

From the external user's perspective, the chunked read mechanism is the same. You call has_next() and read_chunk(). If the user has specified a value for pass_read_limit the set of chunks produced might end up being different (although the concatenation of all of them will still be the same).

The core idea of the code change is to add the idea of the internal pass. Previously we had a file_intermediate_data which held data across read_chunk() calls. There is now a pass_intermediate_data struct which holds information specific to a given pass. Many of the invariant things from the file level before (row groups and chunks to process) are now stored in the pass intermediate data. As we begin each pass, we take the subset of global row groups and chunks that we are going to process for this pass, copy them to out intermediate data, and the remainder of the reader reference this instead of the file-level data.

In order to avoid breaking pre-existing interfaces, there's a new contructor for the chunked_parquet_reader class:

  chunked_parquet_reader(
    std::size_t chunk_read_limit,
    std::size_t pass_read_limit,
    parquet_reader_options const& options,
    rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…amount of memory used for decompression and other scratch space when decoding, which causes the reader to make multiple 'passes' over the set of row groups to be read. Signed-off-by: db <[email protected]>

cpp/src/io/parquet/parquet_gpu.hpp

cpp/src/io/parquet/reader_impl.hpp

cpp/src/io/parquet/reader_impl_preprocess.cu

…g a vector instead of referencing it.

cpp/include/cudf/io/detail/parquet.hpp

cpp/include/cudf/io/parquet.hpp

cpp/include/cudf/io/detail/parquet.hpp

cpp/src/io/parquet/parquet_gpu.hpp

ttnghia · 2023-09-26T20:10:54Z

cpp/src/io/parquet/reader.cpp

+  _impl = std::make_unique<impl>(
+    chunk_read_limit, pass_read_limit, std::move(sources), options, stream, mr);


Can we move this into the initializer list of this constructor?

Can't initialize a parent's member as part of your own initialization list.

ttnghia · 2023-09-26T20:13:34Z

cpp/src/io/parquet/reader_impl.hpp

+  void load_global_chunk_info();
+  void compute_input_pass_row_group_info();
+  void setup_pass();


I've removed a lot of member functions in our previous work for chunked reader, because they can be just free functions. So if possible, please remove such declaration and make them free functions. Having them as member functions, you will have to maintain their signatures here.

If these were made free functions they would all have to be passed large numbers of parameters, which would the code a lot harder to read.

cpp/src/io/utilities/hostdevice_vector.hpp

vuule

review.flush()

cpp/tests/io/parquet_chunked_reader_test.cpp

cpp/include/cudf/io/parquet.hpp

cpp/src/io/parquet/reader_impl_preprocess.cu

cpp/src/io/parquet/parquet_gpu.hpp

hyperbolic2346

Some comments on first pass. I'll make another one tomorrow with a clear head.

cpp/include/cudf/io/detail/parquet.hpp

hyperbolic2346 · 2023-09-26T19:14:10Z

cpp/include/cudf/io/parquet.hpp

+   *
+   * @param chunk_read_limit Limit on total number of bytes to be returned per read,
+   *        or `0` if there is no limit
+   * @param pass_read_limit Limit on the amount of memory used for reading and decompressing data or


Above it says decompression limit, but here it says decompress and reading. I am also concerned by the soft limit. This seems like the thing that you have hard limits. Should it explode on over limit or is the OOM the explosion and that is why it is considered a soft limit?

This is a soft limit because it will attempt to continue even if it can't meet the limit. For example, if the user specified 1 MB and it can't fit even one row group (say 50 MB) into that size, it will still attempt to read/decompress one row group at a time.

cpp/src/io/parquet/reader_impl_preprocess.cu

vuule

another flush, mostly comments coming from difficulty to understand the code.
Still got some parts to dig into.

cpp/src/io/parquet/reader_impl.hpp

cpp/src/io/parquet/reader_impl_preprocess.cu

cpp/include/cudf/io/detail/parquet.hpp

cpp/src/io/parquet/reader_impl.hpp

…hostdevice_vector as it was unnecessary.

vuule

few more small suggestions, looks good overall.
I think the biggest issue for me is the fragility of the test, which we discussed offline and found a solution for.

cpp/src/io/parquet/reader_impl_preprocess.cu

cpp/src/io/parquet/parquet_gpu.hpp

…encoding to keep the hardcoded uncompressed size predictable.

vuule · 2023-09-28T04:32:43Z

/merge

nvdbaranec added 2 commits September 11, 2023 14:36

Add support for progressive chunked reading. Specifically, limit the …

b22ddad

…amount of memory used for decompression and other scratch space when decoding, which causes the reader to make multiple 'passes' over the set of row groups to be read. Signed-off-by: db <[email protected]>

Merge branch 'branch-23.10' into progressive_chunked_reader

b52f158

nvdbaranec added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Sep 11, 2023

nvdbaranec requested a review from a team as a code owner September 11, 2023 22:04

nvdbaranec requested review from hyperbolic2346 and ttnghia September 11, 2023 22:04

nvdbaranec marked this pull request as draft September 11, 2023 22:04

nvdbaranec added 2 commits September 19, 2023 13:38

Merge branch 'branch-23.10' into progressive_chunked_reader

db9a970

Bug fixes + tests.

8bca138

nvdbaranec marked this pull request as ready for review September 20, 2023 21:10

GregoryKimball assigned nvdbaranec Sep 25, 2023

GregoryKimball requested review from vuule and PointKernel September 25, 2023 20:14

PointKernel reviewed Sep 26, 2023

View reviewed changes

nvdbaranec added 2 commits September 26, 2023 13:44

Doc updates. Fixed a couple places where we were inadvertently copyin…

2ecec0f

…g a vector instead of referencing it.

Merge branch 'branch-23.10' into progressive_chunked_reader

3772e7a

nvdbaranec force-pushed the progressive_chunked_reader branch from a519c12 to 3772e7a Compare September 26, 2023 18:53

nvdbaranec requested a review from PointKernel September 26, 2023 18:54