-
Notifications
You must be signed in to change notification settings - Fork 912
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support for progressive parquet chunked reading. (#14079)
Previously, the parquet chunked reader operated by controlling the size of output chunks only. It would still ingest the entire input file and decompress it, which can take up a considerable amount of memory. With this new 'progressive' support, we also 'chunk' at the input level. Specifically, the user can pass a `pass_read_limit` value which controls how much memory is used for storing compressed/decompressed data. The reader will make multiple passes over the file, reading as many row groups as it can to attempt to fit within this limit. Within each pass, chunks are emitted as before. From the external user's perspective, the chunked read mechanism is the same. You call `has_next()` and `read_chunk()`. If the user has specified a value for `pass_read_limit` the set of chunks produced might end up being different (although the concatenation of all of them will still be the same). The core idea of the code change is to add the idea of the internal `pass`. Previously we had a `file_intermediate_data` which held data across `read_chunk()` calls. There is now a `pass_intermediate_data` struct which holds information specific to a given pass. Many of the invariant things from the file level before (row groups and chunks to process) are now stored in the pass intermediate data. As we begin each pass, we take the subset of global row groups and chunks that we are going to process for this pass, copy them to out intermediate data, and the remainder of the reader reference this instead of the file-level data. In order to avoid breaking pre-existing interfaces, there's a new contructor for the `chunked_parquet_reader` class: ``` chunked_parquet_reader( std::size_t chunk_read_limit, std::size_t pass_read_limit, parquet_reader_options const& options, rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()); ``` Authors: - https://github.com/nvdbaranec Approvers: - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) URL: #14079
- Loading branch information
1 parent
2c19bf3
commit 53f0f74
Showing
11 changed files
with
561 additions
and
203 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.