Optimization of tdigest merge aggregation. #16780

nvdbaranec · 2024-09-09T21:26:10Z

This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation. Previously it was doing a linear march over the individual tdigests per group and merging them one by one. This led to terrible performance for large numbers of groups. In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to. Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x,

Old
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        7473 ms         7472 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        7433 ms         7431 ms            8

New
---------------------------------------------------------------------------------------------------------------
Benchmark                                                                     Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------
TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time        6.72 ms         6.79 ms            8
TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time        1.24 ms         1.32 ms            8

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

revans2

I ran a simple performance test against the CPU. An a6000 vs 16 CPU cores to do an approximate percentile on 1,000,000,000 rows with 1,000,000 unique keys in the group by. The GPU was 48x faster than the CPU. So it looks good. I still get errors related to #16675 not being in, and I am not sure if there are any merge conflicts that would show up between the two. I have not looked deeply at the C++ code so I am not approving this, but from the results I am +1 on merging it in.

revans2 · 2024-09-11T14:44:54Z

I did some more testing against the CPU and it looks really good. The improvements range from 5x faster to 32x faster than 16 CPU cores. A lot of the slowness on the CPU comes from spilling/shuffle when there are lots of groups, which we don't appear to suffer from as badly.

cpp/benchmarks/quantiles/tdigest.cu

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

mhaseeb123 · 2024-09-17T20:32:26Z

Looks like we also need a style fix for CI to run.

hyperbolic2346

Couple of nits related to recent changes in rmm/cudf. This is an amazing speed up.

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

…individual tdigests in the tiny and small group benchmarks.

…df::get_current_device_resource_ref() where appropriate.

cpp/src/quantiles/tdigest/tdigest_aggregation.cu

cpp/benchmarks/quantiles/tdigest.cu

Co-authored-by: David Wendt <[email protected]>

mhaseeb123

Style fix needed for CI. LGTM otherwise!

…digest_merge_opt

ttnghia · 2024-09-24T16:27:54Z

If there are just a few CI pipelines broken, don't rerun everything. Instead, click on the "Details" link then rerun only the failed jobs.

nvdbaranec · 2024-09-25T18:25:31Z

If there are just a few CI pipelines broken, don't rerun everything. Instead, click on the "Details" link then rerun only the failed jobs.

That didn't work in this case.

hyperbolic2346

Looks good to me, thanks for making that benchmark change.

ttnghia · 2024-09-25T18:27:15Z

Yeah, it seems the CI workers are not available. Maybe you need to contact devops.

nvdbaranec · 2024-09-25T19:16:10Z

/merge

Fixes rapidsai#16625 This PR fixes a slow implementation of the centroid merging step during the tdigest merge aggregation. Previously it was doing a linear march over the individual tdigests per group and merging them one by one. This led to terrible performance for large numbers of groups. In principle though, all this really was doing was a segmented sort of centroid values. So that's what this PR changes it to. Speedup for 1,000,000 input tidests with 1,000,000 individual groups is ~1000x, ``` Old --------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------- TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time 7473 ms 7472 ms 8 TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time 7433 ms 7431 ms 8 ``` ``` New --------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------------------------------------------- TDigest/many_tiny_groups/1000000/1/1/10000/iterations:8/manual_time 6.72 ms 6.79 ms 8 TDigest/many_tiny_groups2/1000000/1/1/1000/iterations:8/manual_time 1.24 ms 1.32 ms 8 ``` Authors: - https://github.com/nvdbaranec - Muhammad Haseeb (https://github.com/mhaseeb123) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Nghia Truong (https://github.com/ttnghia) - Mike Wilson (https://github.com/hyperbolic2346) URL: rapidsai#16780

Removes unused variable that contains host copy of the group_offsets data. This host variable appears to have been made obsolete by a combination of #16897 and #16780 Found while working on #17149 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Muhammad Haseeb (https://github.com/mhaseeb123) - Nghia Truong (https://github.com/ttnghia) URL: #17151

nvdbaranec added 2 commits September 9, 2024 15:57

Optimize the merging of tdigest groups in the tdigest merge aggregation.

b6aea93

Formatting.

73a6360

nvdbaranec added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 9, 2024

nvdbaranec requested a review from a team as a code owner September 9, 2024 21:26

nvdbaranec requested review from ttnghia and mhaseeb123 September 9, 2024 21:26

nvdbaranec marked this pull request as draft September 9, 2024 21:26

revans2 reviewed Sep 10, 2024

View reviewed changes

nvdbaranec added 3 commits September 11, 2024 16:54

Add tdigest merge benchmark.

996b0cc

Merge branch 'branch-24.10' into tdigest_merge_opt

b02d780

Formatting.

00edf5b

github-actions bot added the CMake CMake build issue label Sep 11, 2024

Merge branch 'branch-24.10' into tdigest_merge_opt

e80154d

nvdbaranec marked this pull request as ready for review September 17, 2024 19:37

davidwendt reviewed Sep 17, 2024

View reviewed changes

cpp/benchmarks/quantiles/tdigest.cu Outdated Show resolved Hide resolved

ttnghia reviewed Sep 17, 2024

View reviewed changes

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved

ttnghia reviewed Sep 17, 2024

View reviewed changes

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved

ttnghia reviewed Sep 17, 2024

View reviewed changes

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved

mhaseeb123 reviewed Sep 17, 2024

View reviewed changes

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved

hyperbolic2346 requested changes Sep 18, 2024

View reviewed changes

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Outdated Show resolved Hide resolved

nvdbaranec added 4 commits September 18, 2024 15:15

Merge branch 'branch-24.10' into tdigest_merge_opt

46dd1f0

Switch to using NVBench for the benchmarks. Added an axis for larger …

c19bab6

…individual tdigests in the tiny and small group benchmarks.

Use device_uvectors instead of full columns in several places. Use cu…

84d71da

…df::get_current_device_resource_ref() where appropriate.

Formatting

72192b8

nvdbaranec requested a review from hyperbolic2346 September 19, 2024 19:03

nvdbaranec requested a review from davidwendt September 19, 2024 19:03

mhaseeb123 reviewed Sep 19, 2024

View reviewed changes

cpp/src/quantiles/tdigest/tdigest_aggregation.cu Show resolved Hide resolved

cpp/benchmarks/quantiles/tdigest.cu Outdated Show resolved Hide resolved

cpp/benchmarks/quantiles/tdigest.cu Outdated Show resolved Hide resolved

nvdbaranec added 2 commits September 20, 2024 12:42

Add some static casts to the state reading code in the benchmark.

eb345ca

Merge branch 'branch-24.10' into tdigest_merge_opt

cc8324f

nvdbaranec requested a review from mhaseeb123 September 20, 2024 17:46

davidwendt reviewed Sep 20, 2024

View reviewed changes

cpp/benchmarks/quantiles/tdigest.cu Outdated Show resolved Hide resolved

Update cpp/benchmarks/quantiles/tdigest.cu

68cd231

Co-authored-by: David Wendt <[email protected]>

mhaseeb123 approved these changes Sep 20, 2024

View reviewed changes

mhaseeb123 added 3 commits September 20, 2024 16:30

Style fix

6c5bc4b

Minor style fix

c3180a4

Merge branch 'branch-24.10' into tdigest_merge_opt

88a092a

ttnghia approved these changes Sep 21, 2024

View reviewed changes

nvdbaranec and others added 6 commits September 23, 2024 11:07

Merge branch 'branch-24.10' into tdigest_merge_opt

e0bfb37

Merge branch 'tdigest_merge_opt' of github.com:nvdbaranec/cudf into t…

7cc7570

…digest_merge_opt

Merge branch 'branch-24.10' into tdigest_merge_opt

6e21bd7

Merge branch 'branch-24.10' into tdigest_merge_opt

cc62cb6

Merge branch 'branch-24.10' into tdigest_merge_opt

77512e8

Merge branch 'tdigest_merge_opt' of github.com:nvdbaranec/cudf into t…

7d21093

…digest_merge_opt

Merge branch 'branch-24.10' into tdigest_merge_opt

743da41

hyperbolic2346 approved these changes Sep 25, 2024

View reviewed changes

rapids-bot bot merged commit 8e78424 into rapidsai:branch-24.10 Sep 25, 2024
100 checks passed

nvdbaranec mentioned this pull request Sep 25, 2024

Add a shortcut for when the input clusters are all empty for the tdigest merge #16897

Merged

3 tasks

davidwendt mentioned this pull request Oct 23, 2024

Remove unused variable in internal merge_tdigests utility #17151

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization of tdigest merge aggregation. #16780

Optimization of tdigest merge aggregation. #16780

nvdbaranec commented Sep 9, 2024 •

edited

Loading

revans2 left a comment

revans2 commented Sep 11, 2024

mhaseeb123 commented Sep 17, 2024

hyperbolic2346 left a comment

mhaseeb123 left a comment

ttnghia commented Sep 24, 2024

nvdbaranec commented Sep 25, 2024

hyperbolic2346 left a comment

ttnghia commented Sep 25, 2024

nvdbaranec commented Sep 25, 2024

Optimization of tdigest merge aggregation. #16780

Optimization of tdigest merge aggregation. #16780

Conversation

nvdbaranec commented Sep 9, 2024 • edited Loading

Checklist

revans2 left a comment

Choose a reason for hiding this comment

revans2 commented Sep 11, 2024

mhaseeb123 commented Sep 17, 2024

hyperbolic2346 left a comment

Choose a reason for hiding this comment

mhaseeb123 left a comment

Choose a reason for hiding this comment

ttnghia commented Sep 24, 2024

nvdbaranec commented Sep 25, 2024

hyperbolic2346 left a comment

Choose a reason for hiding this comment

ttnghia commented Sep 25, 2024

nvdbaranec commented Sep 25, 2024

nvdbaranec commented Sep 9, 2024 •

edited

Loading