-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of hash_character_ngrams using warp-per-string kernel #16212
Improve performance of hash_character_ngrams using warp-per-string kernel #16212
Conversation
Benchmark for
|
cpp/src/text/generate_ngrams.cu
Outdated
return; | ||
} | ||
|
||
namespace cg = cooperative_groups; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for accommodating cg changes @davidwendt. Just to confirm, do the new changes affect the performance results from before in any way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It looks like the performance is the same with cooperative groups as with the cub warp-reduce.
cpp/src/text/generate_ngrams.cu
Outdated
} | ||
|
||
namespace cg = cooperative_groups; | ||
auto warp = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor:
auto warp = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block()); | |
auto const warp = cg::tiled_partition<cudf::detail::warp_size>(cg::this_thread_block()); |
cpp/src/text/generate_ngrams.cu
Outdated
} | ||
} | ||
auto const char_count = cg::reduce(warp, count, cg::plus<int>()); | ||
if (lane_idx == 0) { d_counts[str_idx] = std::max(0, char_count - ngrams + 1); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (lane_idx == 0) { d_counts[str_idx] = std::max(0, char_count - ngrams + 1); } | |
if (lane_idx == 0) { d_counts[str_idx] = max(0, char_count - ngrams + 1); } |
non-blocking nit: use CUDA native functions instead of STL ones in device code since the former is hardware accelerated. cuda::std::max
also works.
})); | ||
auto [offsets, total_ngrams] = | ||
cudf::detail::make_offsets_child_column(sizes_itr, sizes_itr + input.size(), stream, mr); | ||
auto [offsets, total_ngrams] = [&] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat!
/merge |
Fixes performance regression in `nvtext::generate_character_ngrams` introduced in #16212. Thread-per-row kernel is faster for smaller strings. Authors: - David Wendt (https://github.com/davidwendt) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Bradley Dice (https://github.com/bdice) - Nghia Truong (https://github.com/ttnghia) URL: #16849
Description
Improves the performance of
nvtext::hash_character_ngrams
using a warp-per-string kernel instead of a string per thread.Checklist