-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor contains_table
with cuco::static_set
#14064
Refactor contains_table
with cuco::static_set
#14064
Conversation
cpp/src/search/contains_table.cu
Outdated
* @tparam HasAnyNested Flag indicating whether there are nested columns in either haystack or | ||
needles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @tparam HasAnyNested Flag indicating whether there are nested columns in either haystack or | |
needles | |
* @tparam HasAnyNested Flag indicating whether there are nested columns in either haystack or | |
needles |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure it's worth the effort. I assume we shouldn't rely on any manual alignment for code formatting/linting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manual alignment is for dev reading, to enhance aesthetics/readability, not for the generated docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of alignment, I think all lines should start with a single space and an asterisk.
https://github.com/rapidsai/cudf/blob/branch-23.10/cpp/doxygen/developer_guide/DOCUMENTATION.md#block-comments
I think this should apply to blank lines within the block comment as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all lines should start with a single space and an asterisk.
Oh, good catch. Definitely yes, something went wrong with my IDE formatting. Fixed
The peak memory usage drops about 50% after this PR in all test cases. As for performance, there is a trade-off between nested and flat types. The legacy map uses a hardcoded CG size = 4 and there is no way to change it on the user side. With the new set data structure and CG size = 4, the new code outperforms the legacy implementation for both int column and list column:
If we reduce the CG size to 1, the performance for int column can be further improved to about 3.5x speedups while the performance for list column can be as much as 30% slower than the current implementation:
|
Can we selectively create a map (using separate code path) with different CG if the input is flat/nested? |
That's exactly what we are doing in |
The latest performance with CG tuning based on haystack data type:
|
cpp/src/search/contains_table.cu
Outdated
helper_func); | ||
} else { | ||
if (cudf::detail::has_nested_columns(needles)) { | ||
dispatch_nan_comparator<false, true>(compare_nulls, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case I'm interested in investigating in the other thread. Is the <false, true>
dispatch providing special functionality or performance advantages that can't be achieved with 2 dispatches (<true, true>
or <false, false>
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I see your point.
Ideally, users should always check nested against nested. Having the <false, true>
dispatch allows users to check nested needles against a flat haystack (though I'm not sure either if it's a legit use case), and the haystack set insertion would be faster than the <true, true>
dispatch.
For my education, in this particular case, I assume <true, true>
would work fine but less efficiently since we do redundant nested-related checks for a flat column. But would <false, false>
produce proper results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But would <false, false> produce proper results?
Checked with @divyegala, this will cause runtime errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the
<false, true>
dispatch allows users to check nested needles against a flat haystack (though I'm not sure either if it's a legit use case)
Maybe I'm misunderstanding the problem -- but it seems like a nested needle could never be found in a haystack without nesting. Shouldn't that fail because the types of the needle and haystack don't match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't that fail because the types of the needle and haystack don't match?
I'm not sure. The current contains_table
allows this combination and returns 0 in this case. I'm trying to align with the existing behavior. Maybe @ttnghia knows more about what is the expected behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the thing: when hashing the column with nulls, the validity (true/false) of the column will be hashed into the output hash:
if (_check_nulls) {
auto validity_it = detail::make_validity_iterator<true>(curr_col);
hash = detail::accumulate(..., hash,...);
When using hash table (map/set), we must ensure that the hash values for rows in both tables are consistent. That means, if one side table has nulls and hashed with nulls, the other side must also be hashed with the same has_nulls
. Otherwise, one table is hashed differently than the other and you will not find the correct "contains" output. That's why you see the hashers are constructed using has_any_nulls
.
And that is only relevant to hashing.
For equality comparator only (without hashing), we can use different has_nulls
for tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a different issue. Null handling is properly propagated to both self and two-table comparators.
The focus was whether it's legit to have mismatched column types between haystack and needles. After checking the public contains
API, I think the answer is clearer now: I've updated the code so now it will throw if column types mismatch. The implementation is therefore simplified as @bdice suggested. Nice catch, thanks! 🙏
…o cuco-contains-table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the dispatch logic here. I knew that seemed fishy! Everything looks good here to me. I haven't looked at benchmarks or compile times yet. Anything noteworthy there to consider before merging?
edit: compile time for this PR is 9:10. I don't know how this compares to previous implementations.
A snapshot of the compile times near the beginning of 23.10 shows |
@davidwendt Thanks for double-checking the build time. Otherwise, as for benchmark results, we get about 30% to 3.5x speedups for flat columns and about the same performance (slightly better in most cases) for nested ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving CMake changes.
/merge |
Description
Contributes to #12261
This PR refactors
contains_table
to use the newcuco::static_set
data structure. It also adds acontains_table
benchmark to track the performance before and after this work.Checklist