Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Modernize CSV reader and expand reader options #13916

Open
GregoryKimball opened this issue Aug 18, 2023 · 0 comments
Open

[FEA] Modernize CSV reader and expand reader options #13916

GregoryKimball opened this issue Aug 18, 2023 · 0 comments
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Aug 18, 2023

Background

The CSV reader in cuDF/libcudf is a common IO interface for ingesting raw data, and is frequently the first IO interface that new users test when getting started with RAPIDS. There have been many improvements to the CSV reader over the years, but much of the implementation has remained the same from its introduction in #3213 and rework in #5024. We see several opportunities to address the CSV reader continuous improvement milestone, and this story associates open issues with particular functions and kernels in the CSV reading process.

Step 1: Decompression and preprocessing

The CSV reader begins with host-side processing in select_data_and_row_offsets. With the exception of decompression, we would like to migrate this processing to be done device side and refactor this function to use a kvikIO data source. Note, this refactor could also include adding support for the header parameter and byte_range at the same time (code pointer).

The initial processing interacts with several issues:

Step 2: Identify row offsets (delimiters)

The next step is identifying record delimiters and computing row offsets in load_data_and_gather_row_offsets (invoked by select_data_and_row_offsets). This algorithm operates in three main steps: gather_row_offsets called with empty data, select_row_context, and gather_row_offsets called with row context data. The row context state machine is difficult to refactor because it uses a custom data representation that stores several logical values within a single 32-bit or 64-bit physical type (code pointer). The row context tracks whether the content is in a comment block or in a quoted block.

  • gather_row_offsets runs a 4-state "row context" state machine over 16 KB blocks of characters and returns the number of un-quoted, un-commented record delimiters from the block given each possible initial state
  • select_row_context is invoked in a host-side loop over row_ctr data for each 16 KB block, starting from a state 0 initial context.
  • gather_row_offsets is called in a second pass with a valid all_row_offsets data parameter.

Major design topics:

  • We should consider a larger refactor of the "identify row offsets" code based on using a new FST instance (code pointer). Using an FST instance would easily allow us to add additional states beyond the existing 4-state machine. Please refer to the ParPaRaw paper from Elias Stehle et al for more information about parallel algorithms for CSV parsing.
  • To unblock Spark-RAPIDS usage of the CSV, we may also choose to support user-provided all_row_offsets parameter to the read function or as a reader option. This would allow Spark to bypass the first gather_row_offsets pass and select_row_context in load_data_and_gather_row_offsets. When calling read_csv on a strings column, Spark already has the row offsets.
  • Also note that refactoring the interface to provide row offsets is relevant to [FEA] read_csv context-passing interface for distributed/segmented parsing #11728, where we would want to provide pre-computed offsets. For this issue we might prefer a new detail API rather than new parameters in the public API - more design work is needed.

The row offsets algorithm interacts with several open issues:

Step 3: Determine column types

The next step is determining the data types for each column that does not map to a user-provided data type. The function determine_column_types completes this work by collecting the user-provided data types, and then calling infer_column_types to handle the unspecified data types. infer_column_types invokes the detect_column_types->data_type_detection kernel to collect statistics about the data in each field, and then use the conventions of the pandas CSV reader to select a column type.

Step 4: Decode data and populate device buffers

The final step, decode_data, does another pass over the data to decode values according to the determined columns types. The kernel is decode_row_column_data->convert_csv_to_cudf

@GregoryKimball GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Aug 18, 2023
@GregoryKimball GregoryKimball moved this to Story Issue in libcudf Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: Story Issue
Development

No branches or pull requests

1 participant