-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add groupby_max multi-threaded benchmark #16154
Add groupby_max multi-threaded benchmark #16154
Conversation
Thank you @srinivasyadav18 for constructing this! I ran the benchmark and I believe it is working as expected. With an increased thread and stream count, we are seeing higher throughput for smaller batch sizes (perhaps 7 to 27 GB/s for 4M row batches). For larger batches we see saturation around ~60 GB/s for various thread counts. Two items I noticed:
This is what an 8-thread groupby_max looks like with 100M row batches: Some commands I was using:
For the intermediate utilization of 4M rows per batch, you can see how 8 streams increases the SM utilization. Zooming in to the 4M row batch case, I think we are seeing copy engine contention even here in |
@GregoryKimball Suprisingly, I see same results for
|
We explicitly include |
Thanks guys, I ran the profiles and benchmarks above on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @srinivasyadav18 for building these new benchmarks!
The results show higher throughput and the profiles show clearer pipelining behavior: Would you please consider adding an axis that lets us control "num_batches" (or something similar) to |
@vuule would you please take a look? I would like to merge this new benchmark as soon as it is ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just a few nitpicks and questions
…srinivasyadav18/cudf into groupby_max_multithread_nvbench
/merge |
…#16630) This PR fixes a minor bug where the `num_aggregations` axis was missed when working on #16154. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) URL: #16630
Description
This PR adds groupby_max multi-threaded benchmark. The benchmark runs multiple max groupby aggregations concurrently using one CUDA stream per host thread.
Closes #16134
Checklist