[FEA] Modernize CSV reader and expand reader options #13916
Labels
0 - Backlog
In queue waiting for assignment
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Milestone
Background
The CSV reader in cuDF/libcudf is a common IO interface for ingesting raw data, and is frequently the first IO interface that new users test when getting started with RAPIDS. There have been many improvements to the CSV reader over the years, but much of the implementation has remained the same from its introduction in #3213 and rework in #5024. We see several opportunities to address the CSV reader continuous improvement milestone, and this story associates open issues with particular functions and kernels in the CSV reading process.
Step 1: Decompression and preprocessing
The CSV reader begins with host-side processing in
select_data_and_row_offsets
. With the exception of decompression, we would like to migrate this processing to be done device side and refactor this function to use a kvikIO data source. Note, this refactor could also include adding support for theheader
parameter andbyte_range
at the same time (code pointer).The initial processing interacts with several issues:
read_csv
context-passing interface for distributed/segmented parsing #11728 describes how the initial byte range parsing to find the first row assumes the byte_range starts in an unquoted state. If a user provides a byte_range that starts in a quoted field, then the reader will fail! The solution described in this issue interacts the next step "identify row offsets".metadata.schema_info
when column names are autogeneratedStep 2: Identify row offsets (delimiters)
The next step is identifying record delimiters and computing row offsets in
load_data_and_gather_row_offsets
(invoked byselect_data_and_row_offsets
). This algorithm operates in three main steps:gather_row_offsets
called with empty data,select_row_context
, andgather_row_offsets
called with row context data. The row context state machine is difficult to refactor because it uses a custom data representation that stores several logical values within a single 32-bit or 64-bit physical type (code pointer). The row context tracks whether the content is in a comment block or in a quoted block.gather_row_offsets
runs a 4-state "row context" state machine over 16 KB blocks of characters and returns the number of un-quoted, un-commented record delimiters from the block given each possible initial stateselect_row_context
is invoked in a host-side loop overrow_ctr
data for each 16 KB block, starting from astate 0
initial context.gather_row_offsets
is called in a second pass with a validall_row_offsets
data parameter.Major design topics:
all_row_offsets
parameter to the read function or as a reader option. This would allow Spark to bypass the firstgather_row_offsets
pass andselect_row_context
inload_data_and_gather_row_offsets
. When callingread_csv
on a strings column, Spark already has the row offsets.read_csv
context-passing interface for distributed/segmented parsing #11728, where we would want to provide pre-computed offsets. For this issue we might prefer a new detail API rather than new parameters in the public API - more design work is needed.The row offsets algorithm interacts with several open issues:
gather_row_offsets
kernelStep 3: Determine column types
The next step is determining the data types for each column that does not map to a user-provided data type. The function
determine_column_types
completes this work by collecting the user-provided data types, and then callinginfer_column_types
to handle the unspecified data types.infer_column_types
invokes thedetect_column_types
->data_type_detection
kernel to collect statistics about the data in each field, and then use the conventions of the pandas CSV reader to select a column type.seek_field_end
supports escape characters within data fields. perhaps field traversal is already Spark-compatibleseek_field_end
cudf.read_csv
should not cast to floating types if there are null entries in csv #6313 pandas doesn't infer asfloat
if there are any nullsseek_field_end
, maybe not much elseseek_field_end
kernelStep 4: Decode data and populate device buffers
The final step,
decode_data
, does another pass over the data to decode values according to the determined columns types. The kernel isdecode_row_column_data
->convert_csv_to_cudf
trim_whitespaces_quotes
. Related to [FEA] Implement skipinitialspace read_csv parameter #6659csv_reader_options
to read empty strings as blank (i.e.""
), notnull
. #12145 add option to decode""
as empty strings ornull
. Probably an additional parsing option.convert_csv_to_cudf
also usesseek_field_end
which nominally supports escape charactersseek_field_end
nanValue
optionswontfix
The text was updated successfully, but these errors were encountered: