-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strings::contains() for multiple scalar search targets #16641
strings::contains() for multiple scalar search targets #16641
Conversation
This commit adds a new `strings::contains()` overload that allows for the search of multiple scalar search targets in the same call. The trick here is that a new kernel has been introduced, to extend the "string-per-warp" approach to search for multiple search keys in the same kernel. This approach allows CUDF to potentially reduce the number of kernels launched for `string::contains()` by a factor of `N`, if all the search keys can be specified in the same call. This helps reduce the kernel-launch overheads for processes that do large numbers of calls to `string::contains()`. Signed-off-by: MithunR <[email protected]> Changed iteration order, for better cache performance. More optimizations: 1. Removed calls to `thrust::fill()`. The bool values are now explicitly written in the kernel. 2. Switched host-device copy to use async. Revert "More optimizations:" This reverts commit c0e355c. This commit was wrong: The thrust::fill() checks for empty target strings. If removed, we'll need to check for empty target strings for every input string row. This was better done the old way. More improvements: 1. Removed thrust::fill call. Setting values explicitly in the kernel. 2. Switched from using io::hostdevice_vector to rmm::device_uvector. The string_view allocation is tiny. This has helped reduce the time spent in strings::contains(). For small strings, delegate to thread-per-string algo.
Signed-off-by: Chong Gao <[email protected]>
I got this result for benchmark test:
origin: call We get about 1.2x ~ 2x speed. |
/ok to test |
/ok to test |
I'd better cc @davidwendt, whom I consulted as part of #15536. I'd feel a little better about 👍-ing my own work if he reviewed as well. :] Thank you for taking this forward, @res-life. |
Will the comments from #15536 be addressed here? |
Signed-off-by: Chong Gao <[email protected]>
20 targets:
When |
cpp/src/strings/search/find.cu
Outdated
auto const idx = static_cast<size_type>(threadIdx.x + blockIdx.x * blockDim.x); | ||
if (idx >= (num_rows * cudf::detail::warp_size)) { return; } | ||
|
||
auto const lane_idx = idx % cudf::detail::warp_size; | ||
auto const str_idx = idx / cudf::detail::warp_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The num_rows * cudf::detail::warp_size
can overflow size_type
auto const idx = static_cast<size_type>(threadIdx.x + blockIdx.x * blockDim.x); | |
if (idx >= (num_rows * cudf::detail::warp_size)) { return; } | |
auto const lane_idx = idx % cudf::detail::warp_size; | |
auto const str_idx = idx / cudf::detail::warp_size; | |
auto const idx = cudf::detail::grid_1d::global_thread_id(); | |
auto const str_idx = idx / cudf::detail::warp_size; | |
if (str_idx >= num_rows) { return; } | |
auto const lane_idx = idx % cudf::detail::warp_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
The benchmark run fails with
|
/ok to test |
TODO: test perf again. |
/ok to test |
Still have bugs, will fix ASAP. |
/ok to test |
/ok to test |
10 targets:
17 targets, will trigger splitting targets into groups for long strings.
|
Replaced with #16900, so close this. |
Add new `cudf::strings::contains_multiple` API to search multiple targets within a strings column. Output is a table where the number of columns is the number of targets and each row is a boolean indicating that target was found at the row or not. This PR is to help in collaboration with #16641 Authors: - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) - Chong Gao (https://github.com/res-life) - Bradley Dice (https://github.com/bdice) Approvers: - Chong Gao (https://github.com/res-life) - Yunsong Wang (https://github.com/PointKernel) - MithunR (https://github.com/mythrocks) - Tianyu Liu (https://github.com/kingcrimsontianyu) - Bradley Dice (https://github.com/bdice) URL: #16900
Description
This is based on #15536
Added three optimizations:
For each index of the string, sequentially search each target.
This makes the searching for short strings(<=64) very fast.
In this way, when checking the first char in a string for all targets, previously we need to compare
n
times.After this change, we only need
log(n)
times by using binary search.Original:
Now:
Checklist