From 13f028f01ad043b0d24f3e4a28f4267c02806390 Mon Sep 17 00:00:00 2001 From: David Wendt <45795991+davidwendt@users.noreply.github.com> Date: Mon, 13 May 2024 11:39:50 -0400 Subject: [PATCH] Update libcudf developer guide for strings offsets column (#15661) Updates the libcudf Developer Guide to better describe the strings offsets child column and include the offsetalator. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: https://github.com/rapidsai/cudf/pull/15661 --- .../developer_guide/DEVELOPER_GUIDE.md | 96 ++++++++++++++----- 1 file changed, 71 insertions(+), 25 deletions(-) diff --git a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md index 05f8e4585cc..ff80c2daab8 100644 --- a/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md +++ b/cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md @@ -1,4 +1,4 @@ -# libcudf C++ Developer Guide {#DEVELOPER_GUIDE} +# libcudf C++ Developer Guide This document serves as a guide for contributors to libcudf C++ code. Developers should also refer to these additional files for further documentation of libcudf best practices. @@ -828,7 +828,7 @@ This iterator returns the validity of the underlying element (`true` or `false`) The proliferation of data types supported by libcudf can result in long compile times. One area where compile time was a problem is in types used to store indices, which can be any integer type. -The "Indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be +The "indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be used for index types (integers) without requiring a type-specific instance. It can be used for any iterator interface for reading an array of integer values of type `int8`, `int16`, `int32`, `int64`, `uint8`, `uint16`, `uint32`, or `uint64`. Reading specific elements always returns a @@ -856,6 +856,41 @@ thrust::lower_bound(rmm::exec_policy(stream), thrust::less()); ``` +### Offset-normalizing iterators + +Like the [indexalator](#index-normalizing-iterators), +the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be +used for offset column types (`INT32` or `INT64` only) without requiring a type-specific instance. +This is helpful when reading or building [strings columns](#strings-columns). +The normalized type is `int64` which means an `input_offsetsalator` will return `int64` type values +for both `INT32` and `INT64` offsets columns. +Likewise, an `output_offselator` can accept `int64` type values to store into either an +`INT32` or `INT64` output offsets column created appropriately. + +Use the `cudf::detail::offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view. +Example input iterator usage: + +```c++ + // convert the sizes to offsets + auto [offsets, char_bytes] = cudf::strings::detail::make_offsets_child_column( + output_sizes.begin(), output_sizes.end(), stream, mr); + auto d_offsets = + cudf::detail::offsetalator_factory::make_input_iterator(offsets->view()); + // use d_offsets to address the output row bytes +``` + +Example output iterator usage: + +```c++ + // create offsets column as either INT32 or INT64 depending on the number of bytes + auto offsets_column = cudf::strings::detail::create_offsets_child_column(total_bytes, + offsets_count, + stream, mr); + auto d_offsets = + cudf::detail::offsetalator_factory::make_output_iterator(offsets_column->mutable_view()); + // write appropriate offset values to d_offsets +``` + ## Namespaces ### External @@ -1241,18 +1276,20 @@ This is related to [Arrow's "Variable-Size List" memory layout](https://arrow.ap Strings are represented as a column with a data device buffer and a child offsets column. The parent column's type is `STRING` and its data holds all the characters across all the strings packed together -but its size represents the number of strings in the column, and its null mask represents the -validity of each string. To summarize, the strings column children are: - -1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each - string in a dense data buffer of all characters. +but its size represents the number of strings in the column and its null mask represents the +validity of each string. -With this representation, `data[offsets[i]]` is the first character of string `i`, and the -size of string `i` is given by `offsets[i+1] - offsets[i]`. The following image shows an example of -this compound column representation of strings. +The strings column contains a single, non-nullable child column +of offset elements that indicates the byte position offset to the beginning of each +string in the dense data buffer of all characters. With this representation, `data[offsets[i]]` is the +first character of string `i`, and the size of string `i` is given by `offsets[i+1] - offsets[i]`. +The following image shows an example of this compound column representation of strings. ![strings](strings.png) +The type of the offsets column is either `INT32` or `INT64` depending on the number of bytes in the data buffer. +See [`cudf::strings_view`](#cudfstrings_column_view-and-cudfstring_view) for more information on processing individual string rows. + ## Structs columns A struct is a nested data type with a set of child columns each representing an individual field @@ -1295,7 +1332,7 @@ struct column's layout is as follows. (Note that null masks should be read from } ``` -The last struct row (index 3) is not null, but has a null value in the INT32 field. Also, row 2 of +The last struct row (index 3) is not null, but has a null value in the `INT32` field. Also, row 2 of the struct column is null, making its corresponding fields also null. Therefore, bit 2 is unset in the null masks of both struct fields. @@ -1351,18 +1388,27 @@ libcudf provides view types for nested column types as well as for the data elem ### cudf::strings_column_view and cudf::string_view -`cudf::strings_column_view` is a view of a strings column, like `cudf::column_view` is a view of -any `cudf::column`. `cudf::string_view` is a view of a single string, and therefore -`cudf::string_view` is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the -data type for a `cudf::column` of type [`size_type`](#cudfsize_type). As its name implies, this is a -read-only object instance that points to device memory inside the strings column. It's lifespan is -the same (or less) as the column it views. +A `cudf::strings_column_view` wraps a strings column and contains a parent +`cudf::column_view` as a view of the strings column and an offsets `cudf::column_view` +which is a child of the parent. +The parent view contains the offset, size, and validity mask for the strings column. +The offsets view is non-nullable with `offset()==0` and its own size. +Since the offset column type can be either `INT32` or `INT64` it is useful to use the +offset normalizing iterators [offsetalator](#offset-normalizing-iterators) to access individual offset values. + +A `cudf::string_view` is a view of a single string and therefore +is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the +data type for a `cudf::column` of type `INT32`. As its name implies, this is a +read-only object instance that points to device memory inside the strings column. +Its lifespan is the same (or less) as the column it views. +An individual strings column row and a `cudf::string_view` is limited to [`size_type`](#cudfsize_type) bytes. Use the `column_device_view::element` method to access an individual row element. Like any other column, do not call `element()` on a row that is null. ```c++ - cudf::column_device_view d_strings; + cudf::strings_column_view scv; + auto d_strings = cudf::column_device_view::create(scv.parent(), stream); ... if( d_strings.is_valid(row_index) ) { string_view d_str = d_strings.element(row_index); @@ -1370,27 +1416,27 @@ column, do not call `element()` on a row that is null. } ``` -A null string is not the same as an empty string. Use the `string_scalar` class if you need an +A null string is not the same as an empty string. Use the `cudf::string_scalar` class if you need an instance of a class object to represent a null string. -The `string_view` contains comparison operators `<,>,==,<=,>=` that can be used in many cudf -functions like `sort` without string-specific code. The data for a `string_view` instance is +The `cudf::string_view` contains comparison operators `<,>,==,<=,>=` that can be used in many cudf +functions like `sort` without string-specific code. The data for a `cudf::string_view` instance is required to be [UTF-8](#utf-8) and all operators and methods expect this encoding. Unless documented otherwise, position and length parameters are specified in characters and not bytes. The class also -includes a `string_view::const_iterator` which can be used to navigate through individual characters +includes a `cudf::string_view::const_iterator` which can be used to navigate through individual characters within the string. -`cudf::type_dispatcher` dispatches to the `string_view` data type when invoked on a `STRING` column. +`cudf::type_dispatcher` dispatches to the `cudf::string_view` data type when invoked on a `STRING` column. #### UTF-8 The libcudf strings column only supports UTF-8 encoding for strings data. [UTF-8](https://en.wikipedia.org/wiki/UTF-8) is a variable-length character encoding wherein each character can be 1-4 bytes. This means the length of a string is not the same as its size in bytes. -For this reason, it is recommended to use the `string_view` class to access these characters for +For this reason, it is recommended to use the `cudf::string_view` class to access these characters for most operations. -The `string_view.cuh` header also includes some utility methods for reading and writing +The `cudf/strings/detail/utf8.hpp` header also includes some utility methods for reading and writing (`to_char_utf8/from_char_utf8`) individual UTF-8 characters to/from byte arrays. ### cudf::lists_column_view and cudf::lists_view