[FEA] Modernize CSV reader and expand reader options #13916

GregoryKimball · 2023-08-18T19:27:48Z

Background

The CSV reader in cuDF/libcudf is a common IO interface for ingesting raw data, and is frequently the first IO interface that new users test when getting started with RAPIDS. There have been many improvements to the CSV reader over the years, but much of the implementation has remained the same from its introduction in #3213 and rework in #5024. We see several opportunities to address the CSV reader continuous improvement milestone, and this story associates open issues with particular functions and kernels in the CSV reading process.

Step 1: Decompression and preprocessing

The CSV reader begins with host-side processing in select_data_and_row_offsets. With the exception of decompression, we would like to migrate this processing to be done device side and refactor this function to use a kvikIO data source. Note, this refactor could also include adding support for the header parameter and byte_range at the same time (code pointer).

The initial processing interacts with several issues:

[FEA] Avoid host-side processing in CSV reader #13797 is small story issue about this topic
[FEA] Add support to low_memory parameter in read_csv method #4999 batch the full read as small chunks
[FEA] Snappy Compressed CSV Not Implemented #5142
[FEA] read_csv context-passing interface for distributed/segmented parsing #11728 describes how the initial byte range parsing to find the first row assumes the byte_range starts in an unquoted state. If a user provides a byte_range that starts in a quoted field, then the reader will fail! The solution described in this issue interacts the next step "identify row offsets".
[FEA] Support device-side de/compression of CSV files #12255 needs investigation
[FEA] Int64Index for header=None in CSV reader #12582 return empty metadata.schema_info when column names are autogenerated

Step 2: Identify row offsets (delimiters)

The next step is identifying record delimiters and computing row offsets in load_data_and_gather_row_offsets (invoked by select_data_and_row_offsets). This algorithm operates in three main steps: gather_row_offsets called with empty data, select_row_context, and gather_row_offsets called with row context data. The row context state machine is difficult to refactor because it uses a custom data representation that stores several logical values within a single 32-bit or 64-bit physical type (code pointer). The row context tracks whether the content is in a comment block or in a quoted block.

gather_row_offsets runs a 4-state "row context" state machine over 16 KB blocks of characters and returns the number of un-quoted, un-commented record delimiters from the block given each possible initial state
select_row_context is invoked in a host-side loop over row_ctr data for each 16 KB block, starting from a state 0 initial context.
gather_row_offsets is called in a second pass with a valid all_row_offsets data parameter.

Major design topics:

We should consider a larger refactor of the "identify row offsets" code based on using a new FST instance (code pointer). Using an FST instance would easily allow us to add additional states beyond the existing 4-state machine. Please refer to the ParPaRaw paper from Elias Stehle et al for more information about parallel algorithms for CSV parsing.
To unblock Spark-RAPIDS usage of the CSV, we may also choose to support user-provided all_row_offsets parameter to the read function or as a reader option. This would allow Spark to bypass the first gather_row_offsets pass and select_row_context in load_data_and_gather_row_offsets. When calling read_csv on a strings column, Spark already has the row offsets.
Also note that refactoring the interface to provide row offsets is relevant to [FEA] read_csv context-passing interface for distributed/segmented parsing #11728, where we would want to provide pre-computed offsets. For this issue we might prefer a new detail API rather than new parameters in the public API - more design work is needed.

The row offsets algorithm interacts with several open issues:

[FEA] support '\n', '\r' and '\r\n' atthe same time as line delimiters for CSV parsing #6572 complex preprocessing or changes to the row context state machine.
(Spark blocker) [FEA] Add support for escape characters in CSV #11984 Pandas and Spark don't have the same escaping conventions, and the row offset state machine doesn't have an escaped state. Needs confirmation - does this impact the row offsets step?
(Spark blocker) [BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948 to handle misplaced quotes. The issue shows a file getting truncated so fields with misplaced quotes seem to compromise the row offset data. [BUG] read_csv fails to correctly handle misplaced quotes #2398 suggests a workaround
[FEA] Improve escape character and quotation character parsing in Json and CSV reader. #6305 another quoting/escaping issue
[BUG]: cudf.read_csv(comment=#) still including commented line #13856 commented lines should not emit row offsets
Issue n/a: Add unit tests for gather_row_offsets kernel

Step 3: Determine column types

The next step is determining the data types for each column that does not map to a user-provided data type. The function determine_column_types completes this work by collecting the user-provided data types, and then calling infer_column_types to handle the unspecified data types. infer_column_types invokes the detect_column_types->data_type_detection kernel to collect statistics about the data in each field, and then use the conventions of the pandas CSV reader to select a column type.

We should consider refactoring the "determine", "infer", and "detect" function names to improve clarity
(good first issue) Use grid stride in CSV reader kernels #14066 update thread indexing
[PERF] CSV reader data type detection is slow #5080 performance improvements for type inference
(Spark blocker) [FEA] Add support for escape characters in CSV #11984 seek_field_end supports escape characters within data fields. perhaps field traversal is already Spark-compatible
(Spark blocker) [BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948 misplaced quotes could fail with seek_field_end
[BUG] cudf.read_csv should not cast to floating types if there are null entries in csv #6313 pandas doesn't infer as float if there are any nulls
[FEA] Supporting separators with more than 1 character in cudf.read_csv method #9987 would change seek_field_end, maybe not much else
Issue n/a: Add unit tests for seek_field_end kernel

Step 4: Decode data and populate device buffers

The final step, decode_data, does another pass over the data to decode values according to the determined columns types. The kernel is decode_row_column_data->convert_csv_to_cudf

(Spark blocker) [FEA] CSV option to strip trailing white space after a quoted field. #13892 trim white space . Probably a modest change to trim_whitespaces_quotes. Related to [FEA] Implement skipinitialspace read_csv parameter #6659
(Spark blocker) [FEA] csv_reader_options to read empty strings as blank (i.e. ""), not null. #12145 add option to decode "" as empty strings or null. Probably an additional parsing option.
(Spark blocker) [FEA] Add support for escape characters in CSV #11984 convert_csv_to_cudf also uses seek_field_end which nominally supports escape characters
(Spark blocker) [BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948 misplaced quotes could fail with seek_field_end
[FEA] Support nanValue Spark CSV parse option in cudf CSV reader #4001 support additional nanValue options
[BUG] Parsing string to float is inconsistent between CSV reader and to_numeric #10599 float parsing consistency, this is probably a wontfix
Use more efficient double-quote handling (see link and TODO)

The text was updated successfully, but these errors were encountered:

GregoryKimball added feature request New feature or request 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Aug 18, 2023

GregoryKimball added this to the CSV reader continuous improvement milestone Aug 18, 2023

GregoryKimball added this to libcudf Aug 18, 2023

GregoryKimball moved this to Story Issue in libcudf Aug 18, 2023

GregoryKimball mentioned this issue Aug 31, 2024

[FEA] Add multi-threaded Parquet read example #16717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Modernize CSV reader and expand reader options #13916

[FEA] Modernize CSV reader and expand reader options #13916

GregoryKimball commented Aug 18, 2023 •

edited

Loading

[FEA] Modernize CSV reader and expand reader options #13916

[FEA] Modernize CSV reader and expand reader options #13916

Comments

GregoryKimball commented Aug 18, 2023 • edited Loading

Background

Step 1: Decompression and preprocessing

Step 2: Identify row offsets (delimiters)

Step 3: Determine column types

Step 4: Decode data and populate device buffers

GregoryKimball commented Aug 18, 2023 •

edited

Loading