Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve distinct_count with cuco::static_set #13343

Merged
merged 12 commits into from
May 15, 2023

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented May 12, 2023

Description

This PR improves distinct_count with the new cuco::static_set data structure.

Related to #12261

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 12, 2023
@PointKernel PointKernel self-assigned this May 12, 2023
@PointKernel
Copy link
Member Author

The new implementation can be as much as 3x faster by using half of the memory:

distinct_count

[0] Quadro RTX 8000

T num_rows null_probability Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 10000 0 26.441 us 7.60% 26.579 us 7.41% 0.138 us 0.52% PASS
I32 100000 0 53.281 us 4.15% 34.462 us 5.66% -18.819 us -35.32% FAIL
I32 1000000 0 297.742 us 1.00% 95.308 us 3.32% -202.434 us -67.99% FAIL
I32 10000000 0 2.789 ms 0.59% 748.410 us 0.96% -2040.396 us -73.16% FAIL
I32 100000000 0 27.743 ms 0.50% 7.298 ms 0.49% -20445.184 us -73.70% FAIL
I32 10000 0.5 44.349 us 9.68% 45.404 us 8.59% 1.055 us 2.38% PASS
I32 100000 0.5 65.833 us 4.86% 47.609 us 5.54% -18.224 us -27.68% FAIL
I32 1000000 0.5 314.300 us 1.60% 126.808 us 3.22% -187.493 us -59.65% FAIL
I32 10000000 0.5 2.876 ms 0.51% 949.728 us 0.66% -1926.144 us -66.98% FAIL
I32 100000000 0.5 28.575 ms 0.41% 9.233 ms 0.42% -19342.702 us -67.69% FAIL
I64 10000 0 26.324 us 7.22% 26.933 us 8.09% 0.609 us 2.31% PASS
I64 100000 0 52.108 us 3.65% 34.782 us 5.57% -17.326 us -33.25% FAIL
I64 1000000 0 322.172 us 0.93% 102.809 us 1.92% -219.363 us -68.09% FAIL
I64 10000000 0 3.025 ms 0.64% 815.905 us 1.11% -2208.860 us -73.03% FAIL
I64 100000000 0 30.171 ms 0.45% 7.987 ms 0.63% -22184.207 us -73.53% FAIL
I64 10000 0.5 45.450 us 8.97% 44.915 us 9.01% -0.535 us -1.18% PASS
I64 100000 0.5 66.510 us 5.39% 47.582 us 5.12% -18.928 us -28.46% FAIL
I64 1000000 0.5 336.464 us 1.74% 132.266 us 2.31% -204.199 us -60.69% FAIL
I64 10000000 0.5 3.086 ms 0.59% 1.014 ms 0.96% -2072.865 us -67.16% FAIL
I64 100000000 0.5 30.677 ms 0.53% 9.871 ms 0.57% -20805.908 us -67.82% FAIL
F32 10000 0 26.681 us 8.84% 26.246 us 6.77% -0.434 us -1.63% PASS
F32 100000 0 52.426 us 4.44% 34.452 us 5.31% -17.974 us -34.28% FAIL
F32 1000000 0 312.283 us 1.41% 101.746 us 2.15% -210.538 us -67.42% FAIL
F32 10000000 0 2.937 ms 0.60% 784.710 us 1.18% -2152.461 us -73.28% FAIL
F32 100000000 0 29.247 ms 0.39% 7.661 ms 0.54% -21585.667 us -73.80% FAIL
F32 10000 0.5 45.674 us 8.53% 45.368 us 8.66% -0.305 us -0.67% PASS
F32 100000 0.5 65.289 us 3.84% 47.702 us 5.07% -17.587 us -26.94% FAIL
F32 1000000 0.5 331.814 us 1.62% 131.857 us 1.82% -199.956 us -60.26% FAIL
F32 10000000 0.5 3.030 ms 0.55% 991.022 us 0.91% -2039.041 us -67.29% FAIL
F32 100000000 0.5 30.078 ms 0.49% 9.651 ms 0.46% -20426.849 us -67.91% FAIL
F64 10000 0 31.647 us 12.87% 26.924 us 8.73% -4.723 us -14.92% FAIL
F64 100000 0 60.255 us 3.06% 34.770 us 5.81% -25.484 us -42.29% FAIL
F64 1000000 0 380.911 us 1.27% 117.725 us 2.22% -263.186 us -69.09% FAIL
F64 10000000 0 3.625 ms 0.64% 968.808 us 1.76% -2656.116 us -73.27% FAIL
F64 100000000 0 36.106 ms 0.40% 9.475 ms 0.50% -26631.622 us -73.76% FAIL
F64 10000 0.5 49.599 us 10.88% 49.040 us 10.84% -0.560 us -1.13% PASS
F64 100000 0.5 74.627 us 5.03% 47.851 us 5.36% -26.776 us -35.88% FAIL
F64 1000000 0.5 374.257 us 1.14% 147.539 us 2.00% -226.718 us -60.58% FAIL
F64 10000000 0.5 3.461 ms 0.62% 1.136 ms 0.87% -2324.853 us -67.18% FAIL
F64 100000000 0.5 34.322 ms 0.34% 11.059 ms 0.52% -23263.270 us -67.78% FAIL

@bdice
Copy link
Contributor

bdice commented May 12, 2023

@PointKernel These speedups are amazing! Great work here. 😍

rapids-bot bot pushed a commit to rapidsai/rapids-cmake that referenced this pull request May 12, 2023
This is a follow-up PR of #407. It fetches the latest cuco that fixes several bugs uncovered during libcudf integration.

rapidsai/cudf#13343 verified this won't break cudf.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #412
@PointKernel PointKernel added Performance Performance related issue 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 12, 2023
@PointKernel PointKernel marked this pull request as ready for review May 12, 2023 21:40
@PointKernel PointKernel requested a review from a team as a code owner May 12, 2023 21:40
@ttnghia
Copy link
Contributor

ttnghia commented May 12, 2023

Yeah, the speedup is fantastic!

@@ -34,6 +34,8 @@
#include <rmm/cuda_stream_view.hpp>
#include <rmm/exec_policy.hpp>

#include <cuco/static_set.cuh>
Copy link
Contributor

@ttnghia ttnghia May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect to see some header removed from this but don't see any?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for now, some common utilities will no longer be needed but I will leave it to the distinct refactoring with cuco::static_reduction_map to avoid back-and-forth effort.

auto key_set = cuco::experimental::static_set{
cuco::experimental::extent{static_cast<cudf::size_type>(compute_hash_table_size(num_rows))},
cuco::empty_key<cudf::size_type>{COMPACTION_EMPTY_KEY_SENTINEL},
row_equal,
Copy link
Contributor

@ttnghia ttnghia May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the static_set is now constructed with a comparator? Is this a requirement? And insert_if doesn't require a comparator?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the static_set is now constructed with a comparator?

Yes, we are trying to align with std::unordered_set when possible. Similar to allocator and hasher, equality should be passed only once during construction (see reference).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see potential issue with adopting static_set having such requirement into cudf contains APIs. In such places:

  • The haystack table is firstly fed into a static_set, when we need to use self_comparator for key_equal.
  • Then, a needles table is used to check if its rows exist in the set. This time, we need to use two_table_comparator for key_equal.

Copy link
Contributor

@ttnghia ttnghia May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to steal the show. Here is my suggestion to overcome the aforementioned issue: Combine both comparators into a thin proxy wrapper:

template<typename SelfComparator, typename TwoTableComparator>
struct comparator {
  SelfComparator self_comp;
  TwoTableComparator two_table_comp;

  bool operator()(lhs_index_type i, lhs_index_type j) const { return self_comp(static_cast<size_type>(i), static_cast<size_type>(j)); } 
  bool operator()(lhs_index_type i, rhs_index_type j) const { return two_table_comp(i, j); } 
  bool operator()(rhs_index_type i, lhs_index_type j) const { return two_table_comp(i, j); } 
};

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we make sure the desired operator will be used with the equal wrapper?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When inserting, we should insert all keys of type lhs_index_type while when checking contains we will check keys of type rhs_index_type against the lhs_index_type keys in the set.

Theoretically, my proposal should work but that depends on the detail implementation of static_set. If there is compile error with it then we may also need additional modification to static_set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, my proposal should work but that depends on the detail implementation of static_set.

Good point. cuco has an additional indirection layer to facilitate sentinel comparison thus the wrapper may not work as it is. To be tested.

Copy link
Member

@divyegala divyegala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a fantastic PR!

@PointKernel
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit d18fd4e into rapidsai:branch-23.06 May 15, 2023
@PointKernel PointKernel deleted the refactor-distinct-count branch May 15, 2023 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants