Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add groupby_max multi-threaded benchmark #16154

Merged

Conversation

srinivasyadav18
Copy link
Contributor

Description

This PR adds groupby_max multi-threaded benchmark. The benchmark runs multiple max groupby aggregations concurrently using one CUDA stream per host thread.

Closes #16134

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@srinivasyadav18 srinivasyadav18 requested a review from a team as a code owner July 1, 2024 22:44
@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jul 1, 2024
@srinivasyadav18 srinivasyadav18 added feature request New feature or request Performance Performance related issue non-breaking Non-breaking change labels Jul 1, 2024
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jul 2, 2024

Thank you @srinivasyadav18 for constructing this!

I ran the benchmark and I believe it is working as expected. With an increased thread and stream count, we are seeing higher throughput for smaller batch sizes (perhaps 7 to 27 GB/s for 4M row batches). For larger batches we see saturation around ~60 GB/s for various thread counts.
image

Two items I noticed:

  • Would you please enable I64 and F64 as optional values for T? We don't need to run them by default, but it would be nice to be able to choose them.
  • This is the first time we are profiling a multi-threaded hash algorithm - congrats! I suppose we don't see a huge benefit because SM utilization is 100% but warp utilization is 34%. Adding more threads doesn't add more warps because the SMs are already active. Adding more threads does improve pipelining and give some boost to throughput (as we were hoping).

This is what an 8-thread groupby_max looks like with 100M row batches:
image

Some commands I was using:
/nfs/nsight-systems-2022.5.1/bin/nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --gpu-metrics-device=0 --output=/nfs/20240627_databricks/prof_multigroupby ./GROUPBY_NVBENCH -d 0 --profile -b 2 -a T=I32 -a num_rows[pow2]=[15,20,25] -a cardinality=1000000 --timeout 0.02 -a null_probability=0.1

./GROUPBY_NVBENCH -d 0 --profile -b 2 -a T=I32 -a num_rows[pow2]=[10:28:3] -a cardinality=1000000 --timeout 0.02 -a null_probability=0.1 -a num_threads=[1,2,4,8,16,32]

For the intermediate utilization of 4M rows per batch, you can see how 8 streams increases the SM utilization.
image

Zooming in to the 4M row batch case, I think we are seeing copy engine contention even here in groupby_max. Lots of 8byte copies to and from host pageable, and sometimes with no kernel running! Hopefully the work in support of #15620 will also improve pipelining here.

image

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jul 2, 2024

Something about the multithread benchmark is giving lower throughput. At first I thought it was the I64 versus I32 type, but now I think its a different root cause. Are these two commands running the same thing? Why is the timing so much longer in the multithreaded case?

./GROUPBY_NVBENCH -d 0 -b 2 -a T=I32 -a num_rows[pow2]=[22] -a cardinality=1000000 -a null_probability=0.1 -a num_threads=1

## groupby_max_multithreaded

### [0] NVIDIA H100 80GB HBM3

|  T  | cardinality |    num_rows    | null_probability | num_threads | Samples | CPU Time | Noise  | GPU Time | Noise  | Mrows/s | peak_memory_usage |
|-----|-------------|----------------|------------------|-------------|---------|----------|--------|----------|--------|---------|-------------------|
| I32 |     1000000 | 2^22 = 4194304 |              0.1 |           1 |    304x | 1.759 ms | 31.12% | 1.746 ms | 31.03% |    2401 |        72.014 MiB |
(all_cuda-122_arch-x86_64) root@4f8b355cf5d1:/nfs/repo/gk24.08ah100/cpp/build/benchmarks# ./GROUPBY_NVBENCH -d 0 -b 0 -a T=I32 -a num_rows[pow2]=[22] -a cardinality=1000000 -a null_probability=0.1

./GROUPBY_NVBENCH -d 0 -b 0 -a T=I32 -a num_rows[pow2]=[22] -a cardinality=1000000 -a null_probability=0.1

## groupby_max

### [0] NVIDIA H100 80GB HBM3

|  T  | cardinality |    num_rows    | null_probability | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  | Mrows/s | peak_memory_usage |
|-----|-------------|----------------|------------------|---------|------------|--------|------------|--------|---------|-------------------|
| I32 |     1000000 | 2^22 = 4194304 |              0.1 |   3232x | 534.603 us | 17.77% | 527.480 us | 17.69% |    7951 |        72.014 MiB |

Looking for more information... the original groupby max profile shows tight grouping of aggregate
image

Whereas the multithread groupby max shows a gap of nanosleep between each aggregate invocation. Is this timing difference just due to thread launching and synchronization overhead?
image

Perhaps a single-thread, multi-stream benchmark would show higher throughput!!

@srinivasyadav18
Copy link
Contributor Author

@GregoryKimball Suprisingly, I see same results for groupby_max_multithreaded with num_threads=1 and groupby_max on T4. But, yes it might be very different on new GPUs (H100 etc.).

./GROUPBY_NVBENCH -d 0 -b 0 -a T=I32 -a num_rows[pow2]=[22] -a cardinality=1000000 -a null_probability=0.1

## groupby_max

### [0] Tesla T4

|  T  | cardinality |    num_rows    | null_probability | Samples | CPU Time | Noise | GPU Time | Noise | Mrows/s | peak_memory_usage |
|-----|-------------|----------------|------------------|---------|----------|-------|----------|-------|---------|-------------------|
| I32 |     1000000 | 2^22 = 4194304 |              0.1 |   3168x | 3.528 ms | 5.77% | 3.519 ms | 5.09% |    1191 |        72.014 MiB |

./GROUPBY_NVBENCH -d 0 -b 2 -a T=I32 -a num_rows[pow2]=[22] -a cardinality=1000000 -a null_probability=0.1 -a num_threads=1

## groupby_max_multithreaded

### [0] Tesla T4

|  T  | cardinality |    num_rows    | null_probability | num_threads | Samples | CPU Time | Noise  | GPU Time | Noise  | Mrows/s | peak_memory_usage |
|-----|-------------|----------------|------------------|-------------|---------|----------|--------|----------|--------|---------|-------------------|
| I32 |     1000000 | 2^22 = 4194304 |              0.1 |           1 |    944x | 4.079 ms | 11.47% | 4.072 ms | 11.45% |    1030 |        72.014 MiB |

@PointKernel
Copy link
Member

Why is the timing so much longer in the multithreaded case?

We explicitly include threads.wait_for_tasks(); when counting the timer and this can be expensive. For the parquet case it doesn't matter much since kernels are large but with groupby's performance being around several milliseconds, the synchronization cost becomes innegligible IMO. Out of curiosity, what is the CPU in this case?

@GregoryKimball
Copy link
Contributor

Thanks guys, I ran the profiles and benchmarks above on NVIDIA H100 80GB HBM3 + Intel(R) Xeon(R) Platinum 8480CL

cpp/benchmarks/groupby/group_max_multistream.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/groupby/group_max_multithreaded.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/groupby/group_max_multithreaded.cpp Outdated Show resolved Hide resolved
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Jul 4, 2024

Thanks @srinivasyadav18 for building these new benchmarks!

  1. As far as groupby_max_multistream, I think you and @PointKernel were right to doubt this idea. After testing and profiling I think we can drop the single-thread multi-stream benchmark. Would you please remove the groupby_max_multistream?
  2. On the other hand, I also experimented with multiple batches on the same thread. My goal was to increase the amount of work per thread and better amortize per-thread overhead. I used the following pattern to perform 50 aggregations with input size num_rows
auto perform_agg = [&](int64_t index) { 
        for (int64_t i = 0; i < 50 ; i++) {
        gb_obj.aggregate(requests[index], streams[index]); 
        }
      };

The results show higher throughput and the profiles show clearer pipelining behavior:
image

Would you please consider adding an axis that lets us control "num_batches" (or something similar) to groupby_max_multithreaded?

@GregoryKimball
Copy link
Contributor

@vuule would you please take a look? I would like to merge this new benchmark as soon as it is ready

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a few nitpicks and questions

cpp/benchmarks/groupby/group_max_multithreaded.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/groupby/group_max_multithreaded.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/groupby/group_max_multithreaded.cpp Outdated Show resolved Hide resolved
cpp/benchmarks/groupby/group_max_multithreaded.cpp Outdated Show resolved Hide resolved
@srinivasyadav18
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit f592e9c into rapidsai:branch-24.08 Jul 10, 2024
79 checks passed
rapids-bot bot pushed a commit that referenced this pull request Aug 23, 2024
…#16630)

This PR fixes a minor bug where the `num_aggregations` axis was missed when working on #16154.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #16630
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Create a multi-threaded nvbenchmark for groupby_max
5 participants