Add `bytes_per_second` to groupby max benchmark. #13984

Blonck · 2023-08-28T18:31:07Z

This patch adds memory statistics for the GROUPBY_NVBENCH benchmark using the max aggregation.

For this purpose helper functions are introduced to compute the payload size for:

Column
Table
Groupby execution results

This patch relates to #13735.

Checklist

I am familiar with the Contributing Guidelines.

rapids-bot · 2023-08-28T18:31:12Z

Pull requests from external contributors require approval from a rapidsai organization member with write permissions or greater before CI can begin.

Blonck · 2023-08-28T18:32:58Z

Hi, I added a few helper function which I would use also for the other groupby benchmarks.

PS: Should I add the other groupby benchmarks to this PR to reduce the amount of PR for #13735 a bit?

PointKernel · 2023-08-28T19:41:11Z

Would other benchmarks benefit from those helper functions as well? If not, we can move them to benchmarks /groupby/group_common.hpp.

Should I add the other groupby benchmarks to this PR to reduce the amount of PR for #13735 a bit?

Sounds reasonable to me.

Blonck · 2023-08-28T19:56:34Z

Would other benchmarks benefit from those helper functions as well? If not, we can move them to `benchmarks

I think so. They could be used in a few of the benchmarks to simplify the code.

davidwendt · 2023-08-28T20:11:43Z

/ok to test

Blonck · 2023-08-30T19:21:03Z

Hi, the PR is not ready, I had not enough time the last two days. So no need to review it 😃.

copy-pr-bot · 2023-08-30T20:10:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

This patch adds memory statistics for the GROUPBY_NVBENCH benchmarks. For this purpose helper functions are introduced to compute the payload size for: - Column - Table - Groupby execution results This patch relates to rapidsai#13735.

Blonck · 2023-08-31T16:09:29Z

Hi @PointKernel , @davidwendt,

in the current state of the PR every NVBENCH benchmark has bytes_per_second added. However, I'm a bit unhappy with its current state. Specifically, I'd like to introduce test code to validate the calculations for required_bytes, especially when dealing with nested column types.
I'm considering adding these utility functions in the test/utilities directoriy and creating corresponding tests there. I've noticed that similar utility functions used in the tests are also employed within the benchmarks. However, I'm think that housing benchmark-specific helper functions under the test directories is not optimal. Do you have any better ideas?

PS: Does something like a nested column iterator which iterates over all columns, including child columns, of a table. I've not found anything like this, I'm also unsure if it feasible to implement, but would be great for required_bytes 😅.

PPS: benchmark output perf_log.txt

jihoonson · 2024-06-25T18:51:51Z

Hi @Blonck, this is nice work and I'd like to get this change merged. However, it seems that you haven't had a chance to finish up this. Would you mind if I take this from where you left off and finish up?

davidwendt · 2024-06-25T20:32:12Z

You may want to look into using cudf::row_bit_count to get the column/table sizes: https://docs.rapids.ai/api/libcudf/stable/group__transformation__transform.html#gaa78354f8eda093519182149710215d6f
which already handles all cudf types. The output is the number of bits per row but you can use cudf::reduce to get the total size and divide by 8 to get the number of bytes.

jihoonson · 2024-06-25T20:56:51Z

Thanks @davidwendt, this is very useful! I will check out the function you suggested.

…ks (#16126) This PR addresses #13735 for reduction benchmarks. There are 3 new utils added. - `int64_t estimate_size(cudf::table_view)` returns a size estimate for the given table. #13984 was a previous attempt to add a similar utility, but this implementation uses `cudf::row_bit_count()` as suggested in #13984 (comment) instead of manually estimating the size. - `void set_items_processed(State& state, int64_t items_processed_per_iteration)` is a thin wrapper of `State.SetItemsProcessed()`. This wrapper takes `items_processed_per_iteration` as a parameter instead of `total_items_processed`. This could be useful to avoid repeating `State.iterations() * items_processed_per_iteration` in each benchmark class. - `void set_throughputs(nvbench::state& state)` is added as a workaround for NVIDIA/nvbench#175. We sometimes want to set throughput statistics after `state.exec()` calls especially when it is hard to estimate the result size upfront. Here are snippets of reduction benchmarks after this change. ``` $ cpp/build/benchmarks/REDUCTION_BENCH ... ----------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ----------------------------------------------------------------------------------------------------------------- Reduction/bool_all/10000/manual_time 10257 ns 26845 ns 68185 bytes_per_second=929.907M/s items_per_second=975.078M/s Reduction/bool_all/100000/manual_time 11000 ns 27454 ns 63634 bytes_per_second=8.46642G/s items_per_second=9.09075G/s Reduction/bool_all/1000000/manual_time 12671 ns 28658 ns 55261 bytes_per_second=73.5018G/s items_per_second=78.922G/s ... $ cpp/build/benchmarks/REDUCTION_NVBENCH ... ## rank_scan ### [0] NVIDIA RTX A5500 | T | null_probability | data_size | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil | |-----------------|------------------|-----------|---------|------------|--------|------------|-------|----------|--------------|-----------| | I32 | 0 | 10000 | 16992x | 33.544 us | 14.95% | 29.446 us | 5.58% | 82.321M | 5.596 TB/s | 728.54% | | I32 | 0.1 | 10000 | 16512x | 34.358 us | 13.66% | 30.292 us | 2.87% | 80.020M | 5.286 TB/s | 688.17% | | I32 | 0.5 | 10000 | 16736x | 34.058 us | 14.31% | 29.890 us | 3.40% | 81.097M | 5.430 TB/s | 706.89% | ... ``` Note that, when the data type is a 1-byte-width type in the google benchmark result summary, `bytes_per_second` appears to be smaller than `items_per_second`. This is because the former is a multiple of 1000 whereas the latter is a multiple of 1024. They are in fact the same number. Implementation-wise, these are what I'm not sure if I made a best decision. - Each of new utils above is declared and defined in different files. I did this because I could not find a good place to have them all, and they seem to belong to different utilities. Please let me know if there is a better place for them. - All the new utils are defined in the global namespace since other util functions seem to have been defined in the same way. Please let me know if this is not the convention. Authors: - Jihoon Son (https://github.com/jihoonson) Approvers: - Mark Harris (https://github.com/harrism) - David Wendt (https://github.com/davidwendt) URL: #16126

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Aug 28, 2023

PointKernel added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function feature request New feature or request and removed improvement Improvement / enhancement to an existing function labels Aug 28, 2023

Blonck force-pushed the processed_bytes_groupby_max_bench branch from 5879d40 to 405ca60 Compare August 30, 2023 19:16

Blonck force-pushed the processed_bytes_groupby_max_bench branch from 405ca60 to 9a54ab1 Compare August 31, 2023 15:50

Add bytes_per_second to groupby nvbench benchmarks

c5f49ee

This patch adds memory statistics for the GROUPBY_NVBENCH benchmarks. For this purpose helper functions are introduced to compute the payload size for: - Column - Table - Groupby execution results This patch relates to rapidsai#13735.

Blonck force-pushed the processed_bytes_groupby_max_bench branch from 9a54ab1 to c5f49ee Compare August 31, 2023 15:53

GregoryKimball mentioned this pull request Nov 3, 2023

[FEA] Add bytes_per_second to all libcudf benchmarks #13735

Open

jihoonson mentioned this pull request Jun 28, 2024

Add throughput metrics for REDUCTION_BENCH/REDUCTION_NVBENCH benchmarks #16126

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `bytes_per_second` to groupby max benchmark. #13984

Add `bytes_per_second` to groupby max benchmark. #13984

Blonck commented Aug 28, 2023

rapids-bot bot commented Aug 28, 2023

Blonck commented Aug 28, 2023

PointKernel commented Aug 28, 2023

Blonck commented Aug 28, 2023

davidwendt commented Aug 28, 2023

Blonck commented Aug 30, 2023

copy-pr-bot bot commented Aug 30, 2023

Blonck commented Aug 31, 2023 •

edited

Loading

jihoonson commented Jun 25, 2024

davidwendt commented Jun 25, 2024

jihoonson commented Jun 25, 2024

Add bytes_per_second to groupby max benchmark. #13984

Are you sure you want to change the base?

Add bytes_per_second to groupby max benchmark. #13984

Conversation

Blonck commented Aug 28, 2023

Checklist

rapids-bot bot commented Aug 28, 2023

Blonck commented Aug 28, 2023

PointKernel commented Aug 28, 2023

Blonck commented Aug 28, 2023

davidwendt commented Aug 28, 2023

Blonck commented Aug 30, 2023

copy-pr-bot bot commented Aug 30, 2023

Blonck commented Aug 31, 2023 • edited Loading

jihoonson commented Jun 25, 2024

davidwendt commented Jun 25, 2024

jihoonson commented Jun 25, 2024

Add `bytes_per_second` to groupby max benchmark. #13984

Add `bytes_per_second` to groupby max benchmark. #13984

Blonck commented Aug 31, 2023 •

edited

Loading