Rework `read_csv` IO to avoid reading whole input with a single `host_read` #16826

vuule · 2024-09-18T00:58:22Z

Description

The CSV reader ingests all input data with single call to host_read.
This is a problem for a few reasons:

With cudaHostRegister we cannot reliably copy from the mapped region to the GPU without issues with mixing registered and unregistered areas. The reader can't know the datasource implementation details needed to avoid this issue.
Currently the reader performs the H2D copies manually, so there's no multi-threaded or pinned memory optimizations. Using device_read has the potential to outperform manual copies.

This PR changes read_csv IO to perform small host_reads to get the data like BOM and first row. Most of the data is then read in chunks using device_read calls. We can further remove host_reads by moving some of the host processing to the GPU.

No significant changes in performance. We are likely to get performance improvements from future changes like increasing the kvikIO thread pool size.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…into rework-read_csv-ingest

…rework-read_csv-ingest

cpp/src/io/csv/reader_impl.cu

…into rework-read_csv-ingest

…rework-read_csv-ingest

mythrocks

A couple of questions.

I'm not familiar with this code. I'm not sure I'll do this justice on my first read of it.

cpp/src/io/csv/reader_impl.cu

mythrocks · 2024-09-23T19:21:33Z

cpp/src/io/csv/reader_impl.cu

-  rmm::device_uvector<char> d_data{
-    (load_whole_file) ? data.size() : std::min(buffer_size * 2, data.size()), stream};
-  d_data.resize(0, stream);
+  auto pos = range_begin;


I think I'm not reading this correctly: Where is pos modified?

Way down in line 393. I messed with the control flow here as little as I could, this code is very fragile and not very well documented.
Core change is the addition on the byte_range_offset parameter, to be able to read from a source of host buffer that contains the whole file.

cpp/src/io/csv/reader_impl.cu

mythrocks · 2024-09-23T20:00:56Z

As a complete aside, I was wondering if there is any value in making the constructor of cudf::io::detail::csv::selected_rows_offsets explicit. I realize that it wasn't modified here.

Co-authored-by: MithunR <[email protected]>

vuule · 2024-09-25T18:13:43Z

As a complete aside, I was wondering if there is any value in making the constructor of cudf::io::detail::csv::selected_rows_offsets explicit. I realize that it wasn't modified here.

Made it explicit 👍

cpp/src/io/csv/reader_impl.cu

mythrocks · 2024-09-26T18:13:29Z

cpp/src/io/csv/reader_impl.cu

+
+  // None of the parameters for row selection is used, we are parsing the entire file
+  bool const load_whole_file =
+    range_offset == 0 && range_size == 0 && skip_rows <= 0 && skip_end_rows <= 0 && num_rows == -1;


Curious why these aren't all equality checks. Under what scenario would skip_rows < 0 || skip_end_rows < 0?

Negative values mean "no value". We could modernize this with std::optional.

mythrocks

LGTM!

karthikeyann

Looks good to me.
minor nit.

karthikeyann · 2024-09-27T19:59:48Z

cpp/src/io/csv/reader_impl.cu

+                                              bom_buffer->size()};
+      if (has_utf8_bom(bom_chars)) { data_start_offset += sizeof(UTF8_BOM); }
+    } else {
+      constexpr auto find_data_start_chunk_size = 4ul * 1024;


Suggestion for future:
For wide CSVs, if this turn out to take a lot of time, we could double find_data_start_chunk_size number after couple of loops if it can't find terminator.

cpp/src/io/csv/reader_impl.cu

…into rework-read_csv-ingest

vuule · 2024-09-28T06:21:49Z

/merge

…ource` (#16865) Depends on #16826 Set of fixes that improve robustness on the non-GDS file input: 1. Avoid registering beyond the byte range - addresses problems when reading adjacent byte ranges from multiple threads (GH only). 2. Allow reading data outside of the memory mapped region. This prevents issues with very long rows in CSV or JSON input. 3. Copy host data when the range being read is only partially registered. This avoids errors when trying to copy the host data range to the device (GH only). Modifies the datasource class hierarchy to avoid reuse of direct file `host_read`s Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Basit Ayantunde (https://github.com/lamarrr) - Mads R. B. Kristensen (https://github.com/madsbk) - Bradley Dice (https://github.com/bdice) URL: #16865

vuule added 4 commits September 17, 2024 14:28

works

cc4ab53

mild clean up

879d450

bit more clean up

b7b4935

well.. there it is

63abe6a

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 18, 2024

vuule self-assigned this Sep 18, 2024

vuule added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 18, 2024

vuule added 4 commits September 18, 2024 13:06

well.. there it is

9a26102

Merge branch 'rework-read_csv-ingest' of https://github.com/vuule/cudf …

a1f487d

…into rework-read_csv-ingest

Merge branch 'branch-24.10' into rework-read_csv-ingest

f5e5bae

Merge branch 'branch-24.10' of https://github.com/rapidsai/cudf into …

315119f

…rework-read_csv-ingest

vuule changed the base branch from branch-24.10 to branch-24.12 September 19, 2024 22:04

Merge branch 'branch-24.12' into rework-read_csv-ingest

42d560a

vuule commented Sep 19, 2024

View reviewed changes

cpp/src/io/csv/reader_impl.cu Outdated Show resolved Hide resolved

vuule marked this pull request as ready for review September 19, 2024 22:07

vuule requested a review from a team as a code owner September 19, 2024 22:07

vuule requested review from mythrocks and karthikeyann September 19, 2024 22:07

vuule added 3 commits September 20, 2024 11:54

Merge branch 'rework-read_csv-ingest' of https://github.com/vuule/cudf …

ad93cec

…into rework-read_csv-ingest

Merge branch 'branch-24.10' of https://github.com/rapidsai/cudf into …

0b33a4c

…rework-read_csv-ingest

Merge branch 'branch-24.12' of https://github.com/rapidsai/cudf into …

f490c30

…rework-read_csv-ingest

vuule added the Performance Performance related issue label Sep 20, 2024

vuule mentioned this pull request Sep 20, 2024

Properly handle the mapped and registered regions in memory_mapped_source #16865

Merged

3 tasks

Merge branch 'branch-24.12' into rework-read_csv-ingest

7fea0fd

mythrocks reviewed Sep 23, 2024

View reviewed changes

vuule and others added 3 commits September 24, 2024 09:48

Merge branch 'branch-24.12' into rework-read_csv-ingest

bbc1b6c

Merge branch 'branch-24.12' into rework-read_csv-ingest

96c8b64

Apply suggestions from code review

4e19252

Co-authored-by: MithunR <[email protected]>

galipremsagar and others added 4 commits September 24, 2024 18:46

Merge branch 'branch-24.12' into rework-read_csv-ingest

b67cfe2

comment

e8dc274

Co-authored-by: MithunR <[email protected]>

Merge branch 'branch-24.12' into rework-read_csv-ingest

51432b1

explicit ctor

02f8876

vuule requested a review from mythrocks September 25, 2024 18:14

mythrocks reviewed Sep 26, 2024

View reviewed changes

cpp/src/io/csv/reader_impl.cu Show resolved Hide resolved

mythrocks reviewed Sep 26, 2024

View reviewed changes

mythrocks approved these changes Sep 26, 2024

View reviewed changes

Merge branch 'branch-24.12' into rework-read_csv-ingest

e90c188

karthikeyann approved these changes Sep 27, 2024

View reviewed changes

vuule added 3 commits September 27, 2024 20:09

code review suggestions

98a22ae

Merge branch 'rework-read_csv-ingest' of https://github.com/vuule/cudf …

9d1dd6d

…into rework-read_csv-ingest

Merge branch 'branch-24.12' into rework-read_csv-ingest

51b50a1

rapids-bot bot merged commit e2bcbb8 into rapidsai:branch-24.12 Sep 28, 2024
100 checks passed

vuule deleted the rework-read_csv-ingest branch December 10, 2024 19:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework `read_csv` IO to avoid reading whole input with a single `host_read` #16826

Rework `read_csv` IO to avoid reading whole input with a single `host_read` #16826

vuule commented Sep 18, 2024 •

edited

Loading

mythrocks left a comment

mythrocks Sep 23, 2024

vuule Sep 24, 2024

mythrocks commented Sep 23, 2024

vuule commented Sep 25, 2024

mythrocks Sep 26, 2024

vuule Sep 26, 2024

mythrocks left a comment

karthikeyann left a comment

karthikeyann Sep 27, 2024

vuule commented Sep 28, 2024

Rework read_csv IO to avoid reading whole input with a single host_read #16826

Rework read_csv IO to avoid reading whole input with a single host_read #16826

Conversation

vuule commented Sep 18, 2024 • edited Loading

Description

Checklist

mythrocks left a comment

Choose a reason for hiding this comment

mythrocks Sep 23, 2024

Choose a reason for hiding this comment

vuule Sep 24, 2024

Choose a reason for hiding this comment

mythrocks commented Sep 23, 2024

vuule commented Sep 25, 2024

mythrocks Sep 26, 2024

Choose a reason for hiding this comment

vuule Sep 26, 2024

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

karthikeyann Sep 27, 2024

Choose a reason for hiding this comment

vuule commented Sep 28, 2024

Rework `read_csv` IO to avoid reading whole input with a single `host_read` #16826

Rework `read_csv` IO to avoid reading whole input with a single `host_read` #16826

vuule commented Sep 18, 2024 •

edited

Loading