Refactor `contains_table` with cuco::static_set #14064

PointKernel · 2023-09-08T00:02:32Z

Description

Contributes to #12261

This PR refactors contains_table to use the new cuco::static_set data structure. It also adds a contains_table benchmark to track the performance before and after this work.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…ns-table

cpp/src/search/contains_table.cu

…ns-table

cpp/src/search/contains_table.cu

ttnghia · 2023-09-18T17:03:06Z

cpp/src/search/contains_table.cu

+ * @tparam HasAnyNested Flag indicating whether there are nested columns in either haystack or
+ needles


Suggested change

* @tparam HasAnyNested Flag indicating whether there are nested columns in either haystack or

needles

* @tparam HasAnyNested Flag indicating whether there are nested columns in either haystack or

needles

Not sure it's worth the effort. I assume we shouldn't rely on any manual alignment for code formatting/linting.

Manual alignment is for dev reading, to enhance aesthetics/readability, not for the generated docs.

Regardless of alignment, I think all lines should start with a single space and an asterisk.
https://github.com/rapidsai/cudf/blob/branch-23.10/cpp/doxygen/developer_guide/DOCUMENTATION.md#block-comments
I think this should apply to blank lines within the block comment as well.

all lines should start with a single space and an asterisk.

Oh, good catch. Definitely yes, something went wrong with my IDE formatting. Fixed

cpp/src/search/contains_table.cu

PointKernel · 2023-09-19T02:07:58Z

The peak memory usage drops about 50% after this PR in all test cases.

As for performance, there is a trade-off between nested and flat types. The legacy map uses a hardcoded CG size = 4 and there is no way to change it on the user side. With the new set data structure and CG size = 4, the new code outperforms the legacy implementation for both int column and list column:

type	null_probability	table_size	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	0	10000	38.075 us	14.80%	26.917 us	15.44%	-11.158 us	-29.31%	FAIL
I32	0.1	10000	35.490 us	6.37%	25.740 us	7.00%	-9.750 us	-27.47%	FAIL
I32	0	100000	38.219 us	16.96%	28.168 us	6.23%	-10.051 us	-26.30%	FAIL
I32	0.1	100000	40.250 us	10.13%	30.142 us	6.66%	-10.108 us	-25.11%	FAIL
I32	0	1000000	89.140 us	2.51%	67.346 us	2.73%	-21.794 us	-24.45%	FAIL
I32	0.1	1000000	105.566 us	2.34%	83.149 us	4.47%	-22.417 us	-21.23%	FAIL
I32	0	10000000	613.279 us	0.93%	478.395 us	0.83%	-134.883 us	-21.99%	FAIL
I32	0.1	10000000	756.407 us	0.70%	634.450 us	0.68%	-121.957 us	-16.12%	FAIL
cudf::list_view	0	10000	500.887 us	0.51%	488.441 us	0.65%	-12.447 us	-2.48%	FAIL
cudf::list_view	0.1	10000	624.067 us	0.44%	619.780 us	0.36%	-4.287 us	-0.69%	FAIL
cudf::list_view	0	100000	527.369 us	0.54%	513.879 us	0.34%	-13.490 us	-2.56%	FAIL
cudf::list_view	0.1	100000	557.109 us	0.66%	539.606 us	0.60%	-17.503 us	-3.14%	FAIL
cudf::list_view	0	1000000	3.363 ms	0.14%	3.348 ms	0.06%	-15.632 us	-0.46%	FAIL
cudf::list_view	0.1	1000000	4.276 ms	0.15%	4.317 ms	0.07%	41.135 us	0.96%	FAIL
cudf::list_view	0	10000000	2.257 ms	0.20%	2.249 ms	0.13%	-8.445 us	-0.37%	FAIL
cudf::list_view	0.1	10000000	2.862 ms	0.14%	2.886 ms	0.12%	24.101 us	0.84%	FAIL

If we reduce the CG size to 1, the performance for int column can be further improved to about 3.5x speedups while the performance for list column can be as much as 30% slower than the current implementation:

type	null_probability	table_size	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	0	10000	37.817 us	15.00%	26.301 us	9.96%	-11.516 us	-30.45%	FAIL
I32	0.1	10000	35.636 us	6.63%	25.913 us	6.84%	-9.722 us	-27.28%	FAIL
I32	0	100000	36.421 us	6.37%	26.731 us	6.48%	-9.690 us	-26.61%	FAIL
I32	0.1	100000	38.886 us	8.68%	28.775 us	6.04%	-10.111 us	-26.00%	FAIL
I32	0	1000000	89.063 us	2.69%	38.619 us	4.26%	-50.444 us	-56.64%	FAIL
I32	0.1	1000000	105.316 us	2.32%	46.594 us	3.75%	-58.723 us	-55.76%	FAIL
I32	0	10000000	612.065 us	0.75%	168.778 us	1.13%	-443.287 us	-72.42%	FAIL
I32	0.1	10000000	757.799 us	0.77%	217.097 us	0.81%	-540.702 us	-71.35%	FAIL
cudf::list_view	0	10000	487.529 us	0.59%	488.617 us	0.56%	1.087 us	0.22%	PASS
cudf::list_view	0.1	10000	608.179 us	0.48%	621.207 us	0.32%	13.028 us	2.14%	FAIL
cudf::list_view	0	100000	526.683 us	0.52%	630.362 us	0.32%	103.680 us	19.69%	FAIL
cudf::list_view	0.1	100000	557.078 us	0.69%	745.262 us	0.50%	188.184 us	33.78%	FAIL
cudf::list_view	0	1000000	3.355 ms	0.08%	3.605 ms	0.06%	250.828 us	7.48%	FAIL
cudf::list_view	0.1	1000000	4.267 ms	0.08%	4.548 ms	0.33%	281.113 us	6.59%	FAIL
cudf::list_view	0	10000000	2.252 ms	0.17%	2.426 ms	0.10%	174.065 us	7.73%	FAIL
cudf::list_view	0.1	10000000	2.861 ms	0.19%	3.014 ms	0.10%	153.529 us	5.37%	FAIL

ttnghia · 2023-09-19T02:49:55Z

If we reduce the CG size to 1 or 2, the performance for int column can be further improved to about 3.5x speedups while the performance for list column can be as much as 30% slower than the current implementation:

Can we selectively create a map (using separate code path) with different CG if the input is flat/nested?

PointKernel · 2023-09-19T16:42:20Z

Can we selectively create a map (using separate code path) with different CG if the input is flat/nested?

That's exactly what we are doing in contains_column where it uses cudf multiset for flat types and invokes contains_table if it's a nested type. The cudf multiset doesn't have the CG concept and the performance should be similar to the CG size = 1 implementation with new cuco set.

…ns-table

… better performance

PointKernel · 2023-09-20T01:01:29Z

The latest performance with CG tuning based on haystack data type:

type	null_probability	table_size	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	0	10000	38.075 us	14.80%	26.776 us	14.09%	-11.299 us	-29.68%	FAIL
I32	0.1	10000	35.490 us	6.37%	25.960 us	6.55%	-9.530 us	-26.85%	FAIL
I32	0	100000	38.219 us	16.96%	26.678 us	6.56%	-11.541 us	-30.20%	FAIL
I32	0.1	100000	40.250 us	10.13%	28.822 us	6.07%	-11.427 us	-28.39%	FAIL
I32	0	1000000	89.140 us	2.51%	38.517 us	4.52%	-50.623 us	-56.79%	FAIL
I32	0.1	1000000	105.566 us	2.34%	46.443 us	3.72%	-59.123 us	-56.01%	FAIL
I32	0	10000000	613.279 us	0.93%	169.362 us	2.67%	-443.916 us	-72.38%	FAIL
I32	0.1	10000000	756.407 us	0.70%	215.990 us	0.82%	-540.417 us	-71.45%	FAIL
cudf::list_view	0	10000	500.887 us	0.51%	487.216 us	0.46%	-13.671 us	-2.73%	FAIL
cudf::list_view	0.1	10000	624.067 us	0.44%	618.258 us	0.26%	-5.809 us	-0.93%	FAIL
cudf::list_view	0	100000	527.369 us	0.54%	498.654 us	0.44%	-28.715 us	-5.44%	FAIL
cudf::list_view	0.1	100000	557.109 us	0.66%	539.606 us	0.59%	-17.503 us	-3.14%	FAIL
cudf::list_view	0	1000000	3.363 ms	0.14%	3.340 ms	0.07%	-23.658 us	-0.70%	FAIL
cudf::list_view	0.1	1000000	4.276 ms	0.15%	4.307 ms	0.06%	31.480 us	0.74%	FAIL
cudf::list_view	0	10000000	2.257 ms	0.20%	2.233 ms	0.09%	-24.332 us	-1.08%	FAIL
cudf::list_view	0.1	10000000	2.862 ms	0.14%	2.867 ms	0.08%	4.870 us	0.17%	FAIL

cpp/src/search/contains_table.cu

bdice · 2023-09-21T12:19:04Z

cpp/src/search/contains_table.cu

+                                        helper_func);
+  } else {
+    if (cudf::detail::has_nested_columns(needles)) {
+      dispatch_nan_comparator<false, true>(compare_nulls,


This is the case I'm interested in investigating in the other thread. Is the <false, true> dispatch providing special functionality or performance advantages that can't be achieved with 2 dispatches (<true, true> or <false, false>)?

Yeah, I see your point.

Ideally, users should always check nested against nested. Having the <false, true> dispatch allows users to check nested needles against a flat haystack (though I'm not sure either if it's a legit use case), and the haystack set insertion would be faster than the <true, true> dispatch.

For my education, in this particular case, I assume <true, true> would work fine but less efficiently since we do redundant nested-related checks for a flat column. But would <false, false> produce proper results?

But would <false, false> produce proper results?

Checked with @divyegala, this will cause runtime errors.

Having the <false, true> dispatch allows users to check nested needles against a flat haystack (though I'm not sure either if it's a legit use case)

Maybe I'm misunderstanding the problem -- but it seems like a nested needle could never be found in a haystack without nesting. Shouldn't that fail because the types of the needle and haystack don't match?

Shouldn't that fail because the types of the needle and haystack don't match?

I'm not sure. The current contains_table allows this combination and returns 0 in this case. I'm trying to align with the existing behavior. Maybe @ttnghia knows more about what is the expected behavior here.

Here is the thing: when hashing the column with nulls, the validity (true/false) of the column will be hashed into the output hash:

if (_check_nulls) { auto validity_it = detail::make_validity_iterator<true>(curr_col); hash = detail::accumulate(..., hash,...);

When using hash table (map/set), we must ensure that the hash values for rows in both tables are consistent. That means, if one side table has nulls and hashed with nulls, the other side must also be hashed with the same has_nulls. Otherwise, one table is hashed differently than the other and you will not find the correct "contains" output. That's why you see the hashers are constructed using has_any_nulls.

And that is only relevant to hashing.

For equality comparator only (without hashing), we can use different has_nulls for tables.

I think that's a different issue. Null handling is properly propagated to both self and two-table comparators.

The focus was whether it's legit to have mismatched column types between haystack and needles. After checking the public contains API, I think the answer is clearer now: I've updated the code so now it will throw if column types mismatch. The implementation is therefore simplified as @bdice suggested. Nice catch, thanks! 🙏

…ns-table

…o cuco-contains-table

bdice

Thanks for fixing the dispatch logic here. I knew that seemed fishy! Everything looks good here to me. I haven't looked at benchmarks or compile times yet. Anything noteworthy there to consider before merging?

edit: compile time for this PR is 9:10. I don't know how this compares to previous implementations.

davidwendt · 2023-09-26T19:24:13Z

A snapshot of the compile times near the beginning of 23.10 shows src/search/contains_table.cu.o at between 10.5 to 12 minute range when doing a full build. So I think the build time here would be about on par or a little better.

PointKernel · 2023-09-26T19:34:10Z

I haven't looked at benchmarks or compile times yet. Anything noteworthy there to consider before merging?

@davidwendt Thanks for double-checking the build time.

Otherwise, as for benchmark results, we get about 30% to 3.5x speedups for flat columns and about the same performance (slightly better in most cases) for nested ones.

bdice

Approving CMake changes.

PointKernel · 2023-09-26T19:42:08Z

/merge

PointKernel added 3 commits September 7, 2023 16:58

Refactor contains_table with cuco::static_set

8f2294f

Refactor contains_table with cuco::static_set

f2fd994

Merge remote-tracking branch 'upstream/branch-23.10' into cuco-contai…

6c85572

…ns-table

PointKernel added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 8, 2023

PointKernel added this to the Refactor using cuco containers milestone Sep 8, 2023

PointKernel self-assigned this Sep 8, 2023

PointKernel commented Sep 8, 2023

View reviewed changes

cpp/src/search/contains_table.cu Outdated Show resolved Hide resolved

PointKernel mentioned this pull request Sep 8, 2023

[FEA] Refactor hash-based algorithms with new cuco data structures #12261

Open

PointKernel added 5 commits September 12, 2023 09:08

Merge remote-tracking branch 'upstream/branch-23.10' into cuco-contai…

40dfead

…ns-table

Fix logic issues with hashset

bec3cd6

Get rid of build_row_bitmask function

5ada788

Minor cleanups: renaming

204bd45

Merge branch 'branch-23.10' into cuco-contains-table

6ae6f2a

PointKernel marked this pull request as ready for review September 13, 2023 22:29

PointKernel requested a review from a team as a code owner September 13, 2023 22:29

PointKernel requested review from robertmaynard and ttnghia September 13, 2023 22:29

PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 13, 2023

ttnghia reviewed Sep 13, 2023

View reviewed changes

cpp/src/search/contains_table.cu Outdated Show resolved Hide resolved

PointKernel added 2 commits September 14, 2023 13:46

Use build_row_bitmmask instead of bitmask_and

9c7f4f6

Code formatting

237fd70

PointKernel requested a review from ttnghia September 15, 2023 00:10

Merge branch 'branch-23.10' into cuco-contains-table

7283dd3

ttnghia reviewed Sep 18, 2023

View reviewed changes

cpp/src/search/contains_table.cu Outdated Show resolved Hide resolved

Add peak memory usage in contains_table benchmark

b2934e5

github-actions bot added the CMake CMake build issue label Sep 19, 2023

Minor doc cleanups

218ab4f

PointKernel added 2 commits September 19, 2023 17:33

Merge remote-tracking branch 'upstream/branch-23.10' into cuco-contai…

6fcaa46

…ns-table

Distinguish probing scheme CG sizes between nested and flat types for…

9553104

… better performance

PointKernel requested a review from ttnghia September 20, 2023 01:01

ttnghia approved these changes Sep 20, 2023

View reviewed changes

Merge branch 'branch-23.10' into cuco-contains-table

0b67f9b

bdice reviewed Sep 21, 2023

View reviewed changes

PointKernel added 8 commits September 21, 2023 16:41

Merge branch 'branch-23.10' into cuco-contains-table

71a8793

Merge remote-tracking branch 'upstream/branch-23.10' into cuco-contai…

cd75057

…ns-table

Remove redundant docs

e1125c3

Throw if needles and haystack column types mismatch

37f7048

Simplify nested column handling

4f6af5d

Merge branch 'cuco-contains-table' of github.com:PointKernel/cudf int…

de99c48

…o cuco-contains-table

Merge branch 'branch-23.10' into cuco-contains-table

cb9614d

Merge branch 'branch-23.10' into cuco-contains-table

b848ada

PointKernel requested review from bdice and davidwendt September 26, 2023 01:23

bdice approved these changes Sep 26, 2023

View reviewed changes

rapids-bot bot merged commit 030c0f4 into rapidsai:branch-23.10 Sep 26, 2023
54 checks passed

PointKernel deleted the cuco-contains-table branch September 26, 2023 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `contains_table` with cuco::static_set #14064

Refactor `contains_table` with cuco::static_set #14064

PointKernel commented Sep 8, 2023 •

edited

Loading

ttnghia Sep 18, 2023

PointKernel Sep 18, 2023

ttnghia Sep 18, 2023 •

edited

Loading

davidwendt Sep 19, 2023

PointKernel Sep 19, 2023

PointKernel commented Sep 19, 2023 •

edited

Loading

ttnghia commented Sep 19, 2023

PointKernel commented Sep 19, 2023

PointKernel commented Sep 20, 2023

bdice Sep 21, 2023

PointKernel Sep 21, 2023

PointKernel Sep 21, 2023

bdice Sep 22, 2023

PointKernel Sep 22, 2023

ttnghia Sep 22, 2023 •

edited

Loading

PointKernel Sep 22, 2023 •

edited

Loading

bdice left a comment •

edited

Loading

davidwendt commented Sep 26, 2023

PointKernel commented Sep 26, 2023 •

edited

Loading

bdice left a comment

PointKernel commented Sep 26, 2023

		* @tparam HasAnyNested Flag indicating whether there are nested columns in either haystack or
		needles

Refactor contains_table with cuco::static_set #14064

Refactor contains_table with cuco::static_set #14064

Conversation

PointKernel commented Sep 8, 2023 • edited Loading

Description

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Sep 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PointKernel commented Sep 19, 2023 • edited Loading

ttnghia commented Sep 19, 2023

PointKernel commented Sep 19, 2023

PointKernel commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ttnghia Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

PointKernel Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

bdice left a comment • edited Loading

Choose a reason for hiding this comment

davidwendt commented Sep 26, 2023

PointKernel commented Sep 26, 2023 • edited Loading

bdice left a comment

Choose a reason for hiding this comment

PointKernel commented Sep 26, 2023

Refactor `contains_table` with cuco::static_set #14064

Refactor `contains_table` with cuco::static_set #14064

PointKernel commented Sep 8, 2023 •

edited

Loading

ttnghia Sep 18, 2023 •

edited

Loading

PointKernel commented Sep 19, 2023 •

edited

Loading

ttnghia Sep 22, 2023 •

edited

Loading

PointKernel Sep 22, 2023 •

edited

Loading

bdice left a comment •

edited

Loading

PointKernel commented Sep 26, 2023 •

edited

Loading