Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for progressive parquet chunked reading. #14079

Merged

Conversation

nvdbaranec
Copy link
Contributor

@nvdbaranec nvdbaranec commented Sep 11, 2023

Previously, the parquet chunked reader operated by controlling the size of output chunks only. It would still ingest the entire input file and decompress it, which can take up a considerable amount of memory. With this new 'progressive' support, we also 'chunk' at the input level. Specifically, the user can pass a pass_read_limit value which controls how much memory is used for storing compressed/decompressed data. The reader will make multiple passes over the file, reading as many row groups as it can to attempt to fit within this limit. Within each pass, chunks are emitted as before.

From the external user's perspective, the chunked read mechanism is the same. You call has_next() and read_chunk(). If the user has specified a value for pass_read_limit the set of chunks produced might end up being different (although the concatenation of all of them will still be the same).

The core idea of the code change is to add the idea of the internal pass. Previously we had a file_intermediate_data which held data across read_chunk() calls. There is now a pass_intermediate_data struct which holds information specific to a given pass. Many of the invariant things from the file level before (row groups and chunks to process) are now stored in the pass intermediate data. As we begin each pass, we take the subset of global row groups and chunks that we are going to process for this pass, copy them to out intermediate data, and the remainder of the reader reference this instead of the file-level data.

In order to avoid breaking pre-existing interfaces, there's a new contructor for the chunked_parquet_reader class:

  chunked_parquet_reader(
    std::size_t chunk_read_limit,
    std::size_t pass_read_limit,
    parquet_reader_options const& options,
    rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

…amount of memory used for decompression and other scratch space when decoding,

which causes the reader to make multiple 'passes' over the set of row groups to be read.

Signed-off-by: db <[email protected]>
@nvdbaranec nvdbaranec added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Sep 11, 2023
@nvdbaranec nvdbaranec requested a review from a team as a code owner September 11, 2023 22:04
@nvdbaranec nvdbaranec marked this pull request as draft September 11, 2023 22:04
@nvdbaranec nvdbaranec marked this pull request as ready for review September 20, 2023 21:10
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.hpp Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
@nvdbaranec nvdbaranec force-pushed the progressive_chunked_reader branch from a519c12 to 3772e7a Compare September 26, 2023 18:53
Comment on lines +52 to +53
_impl = std::make_unique<impl>(
chunk_read_limit, pass_read_limit, std::move(sources), options, stream, mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this into the initializer list of this constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't initialize a parent's member as part of your own initialization list.

Comment on lines +139 to +141
void load_global_chunk_info();
void compute_input_pass_row_group_info();
void setup_pass();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed a lot of member functions in our previous work for chunked reader, because they can be just free functions. So if possible, please remove such declaration and make them free functions. Having them as member functions, you will have to maintain their signatures here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these were made free functions they would all have to be passed large numbers of parameters, which would the code a lot harder to read.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review.flush()

cpp/tests/io/parquet_chunked_reader_test.cpp Outdated Show resolved Hide resolved
cpp/include/cudf/io/parquet.hpp Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Show resolved Hide resolved
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on first pass. I'll make another one tomorrow with a clear head.

cpp/include/cudf/io/detail/parquet.hpp Show resolved Hide resolved
*
* @param chunk_read_limit Limit on total number of bytes to be returned per read,
* or `0` if there is no limit
* @param pass_read_limit Limit on the amount of memory used for reading and decompressing data or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above it says decompression limit, but here it says decompress and reading. I am also concerned by the soft limit. This seems like the thing that you have hard limits. Should it explode on over limit or is the OOM the explosion and that is why it is considered a soft limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a soft limit because it will attempt to continue even if it can't meet the limit. For example, if the user specified 1 MB and it can't fit even one row group (say 50 MB) into that size, it will still attempt to read/decompress one row group at a time.

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another flush, mostly comments coming from difficulty to understand the code.
Still got some parts to dig into.

cpp/src/io/parquet/reader_impl.hpp Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.hpp Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/include/cudf/io/detail/parquet.hpp Show resolved Hide resolved
cpp/src/io/parquet/reader_impl.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few more small suggestions, looks good overall.
I think the biggest issue for me is the fragility of the test, which we discussed offline and found a solution for.

cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/reader_impl_preprocess.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
…encoding to keep the hardcoded uncompressed size predictable.
@nvdbaranec nvdbaranec requested a review from vuule September 27, 2023 22:50
@vuule
Copy link
Contributor

vuule commented Sep 28, 2023

/merge

@rapids-bot rapids-bot bot merged commit 53f0f74 into rapidsai:branch-23.10 Sep 28, 2023
54 of 57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants