Skip to content

Commit

Permalink
Update libcudf developer guide for strings offsets column (#15661)
Browse files Browse the repository at this point in the history
Updates the libcudf Developer Guide to better describe the strings offsets child column and include the offsetalator.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: #15661
  • Loading branch information
davidwendt authored May 13, 2024
1 parent 149253b commit 13f028f
Showing 1 changed file with 71 additions and 25 deletions.
96 changes: 71 additions & 25 deletions cpp/doxygen/developer_guide/DEVELOPER_GUIDE.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# libcudf C++ Developer Guide {#DEVELOPER_GUIDE}
# libcudf C++ Developer Guide

This document serves as a guide for contributors to libcudf C++ code. Developers should also refer
to these additional files for further documentation of libcudf best practices.
Expand Down Expand Up @@ -828,7 +828,7 @@ This iterator returns the validity of the underlying element (`true` or `false`)
The proliferation of data types supported by libcudf can result in long compile times. One area
where compile time was a problem is in types used to store indices, which can be any integer type.
The "Indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be
The "indexalator", or index-normalizing iterator (`include/cudf/detail/indexalator.cuh`), can be
used for index types (integers) without requiring a type-specific instance. It can be used for any
iterator interface for reading an array of integer values of type `int8`, `int16`, `int32`,
`int64`, `uint8`, `uint16`, `uint32`, or `uint64`. Reading specific elements always returns a
Expand Down Expand Up @@ -856,6 +856,41 @@ thrust::lower_bound(rmm::exec_policy(stream),
thrust::less<Element>());
```
### Offset-normalizing iterators
Like the [indexalator](#index-normalizing-iterators),
the "offsetalator", or offset-normalizing iterator (`include/cudf/detail/offsetalator.cuh`), can be
used for offset column types (`INT32` or `INT64` only) without requiring a type-specific instance.
This is helpful when reading or building [strings columns](#strings-columns).
The normalized type is `int64` which means an `input_offsetsalator` will return `int64` type values
for both `INT32` and `INT64` offsets columns.
Likewise, an `output_offselator` can accept `int64` type values to store into either an
`INT32` or `INT64` output offsets column created appropriately.
Use the `cudf::detail::offsetalator_factory` to create an appropriate input or output iterator from an offsets column_view.
Example input iterator usage:
```c++
// convert the sizes to offsets
auto [offsets, char_bytes] = cudf::strings::detail::make_offsets_child_column(
output_sizes.begin(), output_sizes.end(), stream, mr);
auto d_offsets =
cudf::detail::offsetalator_factory::make_input_iterator(offsets->view());
// use d_offsets to address the output row bytes
```

Example output iterator usage:

```c++
// create offsets column as either INT32 or INT64 depending on the number of bytes
auto offsets_column = cudf::strings::detail::create_offsets_child_column(total_bytes,
offsets_count,
stream, mr);
auto d_offsets =
cudf::detail::offsetalator_factory::make_output_iterator(offsets_column->mutable_view());
// write appropriate offset values to d_offsets
```
## Namespaces
### External
Expand Down Expand Up @@ -1241,18 +1276,20 @@ This is related to [Arrow's "Variable-Size List" memory layout](https://arrow.ap
Strings are represented as a column with a data device buffer and a child offsets column.
The parent column's type is `STRING` and its data holds all the characters across all the strings packed together
but its size represents the number of strings in the column, and its null mask represents the
validity of each string. To summarize, the strings column children are:
1. A non-nullable column of [`size_type`](#cudfsize_type) elements that indicates the offset to the beginning of each
string in a dense data buffer of all characters.
but its size represents the number of strings in the column and its null mask represents the
validity of each string.
With this representation, `data[offsets[i]]` is the first character of string `i`, and the
size of string `i` is given by `offsets[i+1] - offsets[i]`. The following image shows an example of
this compound column representation of strings.
The strings column contains a single, non-nullable child column
of offset elements that indicates the byte position offset to the beginning of each
string in the dense data buffer of all characters. With this representation, `data[offsets[i]]` is the
first character of string `i`, and the size of string `i` is given by `offsets[i+1] - offsets[i]`.
The following image shows an example of this compound column representation of strings.
![strings](strings.png)
The type of the offsets column is either `INT32` or `INT64` depending on the number of bytes in the data buffer.
See [`cudf::strings_view`](#cudfstrings_column_view-and-cudfstring_view) for more information on processing individual string rows.
## Structs columns
A struct is a nested data type with a set of child columns each representing an individual field
Expand Down Expand Up @@ -1295,7 +1332,7 @@ struct column's layout is as follows. (Note that null masks should be read from
}
```
The last struct row (index 3) is not null, but has a null value in the INT32 field. Also, row 2 of
The last struct row (index 3) is not null, but has a null value in the `INT32` field. Also, row 2 of
the struct column is null, making its corresponding fields also null. Therefore, bit 2 is unset in
the null masks of both struct fields.
Expand Down Expand Up @@ -1351,46 +1388,55 @@ libcudf provides view types for nested column types as well as for the data elem
### cudf::strings_column_view and cudf::string_view
`cudf::strings_column_view` is a view of a strings column, like `cudf::column_view` is a view of
any `cudf::column`. `cudf::string_view` is a view of a single string, and therefore
`cudf::string_view` is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
data type for a `cudf::column` of type [`size_type`](#cudfsize_type). As its name implies, this is a
read-only object instance that points to device memory inside the strings column. It's lifespan is
the same (or less) as the column it views.
A `cudf::strings_column_view` wraps a strings column and contains a parent
`cudf::column_view` as a view of the strings column and an offsets `cudf::column_view`
which is a child of the parent.
The parent view contains the offset, size, and validity mask for the strings column.
The offsets view is non-nullable with `offset()==0` and its own size.
Since the offset column type can be either `INT32` or `INT64` it is useful to use the
offset normalizing iterators [offsetalator](#offset-normalizing-iterators) to access individual offset values.
A `cudf::string_view` is a view of a single string and therefore
is the data type of a `cudf::column` of type `STRING` just like `int32_t` is the
data type for a `cudf::column` of type `INT32`. As its name implies, this is a
read-only object instance that points to device memory inside the strings column.
Its lifespan is the same (or less) as the column it views.
An individual strings column row and a `cudf::string_view` is limited to [`size_type`](#cudfsize_type) bytes.
Use the `column_device_view::element` method to access an individual row element. Like any other
column, do not call `element()` on a row that is null.
```c++
cudf::column_device_view d_strings;
cudf::strings_column_view scv;
auto d_strings = cudf::column_device_view::create(scv.parent(), stream);
...
if( d_strings.is_valid(row_index) ) {
string_view d_str = d_strings.element<string_view>(row_index);
...
}
```

A null string is not the same as an empty string. Use the `string_scalar` class if you need an
A null string is not the same as an empty string. Use the `cudf::string_scalar` class if you need an
instance of a class object to represent a null string.

The `string_view` contains comparison operators `<,>,==,<=,>=` that can be used in many cudf
functions like `sort` without string-specific code. The data for a `string_view` instance is
The `cudf::string_view` contains comparison operators `<,>,==,<=,>=` that can be used in many cudf
functions like `sort` without string-specific code. The data for a `cudf::string_view` instance is
required to be [UTF-8](#utf-8) and all operators and methods expect this encoding. Unless documented
otherwise, position and length parameters are specified in characters and not bytes. The class also
includes a `string_view::const_iterator` which can be used to navigate through individual characters
includes a `cudf::string_view::const_iterator` which can be used to navigate through individual characters
within the string.

`cudf::type_dispatcher` dispatches to the `string_view` data type when invoked on a `STRING` column.
`cudf::type_dispatcher` dispatches to the `cudf::string_view` data type when invoked on a `STRING` column.

#### UTF-8

The libcudf strings column only supports UTF-8 encoding for strings data.
[UTF-8](https://en.wikipedia.org/wiki/UTF-8) is a variable-length character encoding wherein each
character can be 1-4 bytes. This means the length of a string is not the same as its size in bytes.
For this reason, it is recommended to use the `string_view` class to access these characters for
For this reason, it is recommended to use the `cudf::string_view` class to access these characters for
most operations.

The `string_view.cuh` header also includes some utility methods for reading and writing
The `cudf/strings/detail/utf8.hpp` header also includes some utility methods for reading and writing
(`to_char_utf8/from_char_utf8`) individual UTF-8 characters to/from byte arrays.

### cudf::lists_column_view and cudf::lists_view
Expand Down

0 comments on commit 13f028f

Please sign in to comment.