Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bytes_per_second to groupby max benchmark. #13984

Draft
wants to merge 1 commit into
base: branch-23.10
Choose a base branch
from

Conversation

Blonck
Copy link
Contributor

@Blonck Blonck commented Aug 28, 2023

This patch adds memory statistics for the GROUPBY_NVBENCH benchmark using the max aggregation.

For this purpose helper functions are introduced to compute the payload size for:

  • Column
  • Table
  • Groupby execution results

This patch relates to #13735.

Checklist

@rapids-bot
Copy link

rapids-bot bot commented Aug 28, 2023

Pull requests from external contributors require approval from a rapidsai organization member with write permissions or greater before CI can begin.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Aug 28, 2023
@Blonck
Copy link
Contributor Author

Blonck commented Aug 28, 2023

Hi, I added a few helper function which I would use also for the other groupby benchmarks.

PS: Should I add the other groupby benchmarks to this PR to reduce the amount of PR for #13735 a bit?

@PointKernel
Copy link
Member

Would other benchmarks benefit from those helper functions as well? If not, we can move them to benchmarks /groupby/group_common.hpp.

Should I add the other groupby benchmarks to this PR to reduce the amount of PR for #13735 a bit?

Sounds reasonable to me.

@PointKernel PointKernel added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function feature request New feature or request and removed improvement Improvement / enhancement to an existing function labels Aug 28, 2023
@Blonck
Copy link
Contributor Author

Blonck commented Aug 28, 2023

Would other benchmarks benefit from those helper functions as well? If not, we can move them to `benchmarks

I think so. They could be used in a few of the benchmarks to simplify the code.

@davidwendt
Copy link
Contributor

/ok to test

@Blonck Blonck force-pushed the processed_bytes_groupby_max_bench branch from 5879d40 to 405ca60 Compare August 30, 2023 19:16
@Blonck
Copy link
Contributor Author

Blonck commented Aug 30, 2023

Hi, the PR is not ready, I had not enough time the last two days. So no need to review it 😃.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Aug 30, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Blonck Blonck force-pushed the processed_bytes_groupby_max_bench branch from 405ca60 to 9a54ab1 Compare August 31, 2023 15:50
This patch adds memory statistics for the GROUPBY_NVBENCH benchmarks.

For this purpose helper functions are introduced to compute the payload
size for:
  - Column
  - Table
  - Groupby execution results

This patch relates to rapidsai#13735.
@Blonck Blonck force-pushed the processed_bytes_groupby_max_bench branch from 9a54ab1 to c5f49ee Compare August 31, 2023 15:53
@Blonck
Copy link
Contributor Author

Blonck commented Aug 31, 2023

Hi @PointKernel , @davidwendt,

in the current state of the PR every NVBENCH benchmark has bytes_per_second added. However, I'm a bit unhappy with its current state. Specifically, I'd like to introduce test code to validate the calculations for required_bytes, especially when dealing with nested column types.
I'm considering adding these utility functions in the test/utilities directoriy and creating corresponding tests there. I've noticed that similar utility functions used in the tests are also employed within the benchmarks. However, I'm think that housing benchmark-specific helper functions under the test directories is not optimal. Do you have any better ideas?

PS: Does something like a nested column iterator which iterates over all columns, including child columns, of a table. I've not found anything like this, I'm also unsure if it feasible to implement, but would be great for required_bytes 😅.

PPS: benchmark output perf_log.txt

@jihoonson
Copy link
Contributor

Hi @Blonck, this is nice work and I'd like to get this change merged. However, it seems that you haven't had a chance to finish up this. Would you mind if I take this from where you left off and finish up?

@davidwendt
Copy link
Contributor

You may want to look into using cudf::row_bit_count to get the column/table sizes: https://docs.rapids.ai/api/libcudf/stable/group__transformation__transform.html#gaa78354f8eda093519182149710215d6f
which already handles all cudf types. The output is the number of bits per row but you can use cudf::reduce to get the total size and divide by 8 to get the number of bytes.

@jihoonson
Copy link
Contributor

Thanks @davidwendt, this is very useful! I will check out the function you suggested.

rapids-bot bot pushed a commit that referenced this pull request Jul 2, 2024
…ks (#16126)

This PR addresses #13735 for reduction benchmarks. There are 3 new utils added.

- `int64_t estimate_size(cudf::table_view)` returns a size estimate for the given table. #13984 was a previous attempt to add a similar utility, but this implementation uses `cudf::row_bit_count()` as suggested in #13984 (comment) instead of manually estimating the size.
- `void set_items_processed(State& state, int64_t items_processed_per_iteration)` is a thin wrapper of `State.SetItemsProcessed()`. This wrapper takes `items_processed_per_iteration` as a parameter instead of `total_items_processed`. This could be useful to avoid repeating `State.iterations() * items_processed_per_iteration` in each benchmark class.
- `void set_throughputs(nvbench::state& state)` is added as a workaround for NVIDIA/nvbench#175. We sometimes want to set throughput statistics after `state.exec()` calls especially when it is hard to estimate the result size upfront.

Here are snippets of reduction benchmarks after this change.

```
$ cpp/build/benchmarks/REDUCTION_BENCH
...
-----------------------------------------------------------------------------------------------------------------
Benchmark                                                       Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------
Reduction/bool_all/10000/manual_time                        10257 ns        26845 ns        68185 bytes_per_second=929.907M/s items_per_second=975.078M/s
Reduction/bool_all/100000/manual_time                       11000 ns        27454 ns        63634 bytes_per_second=8.46642G/s items_per_second=9.09075G/s
Reduction/bool_all/1000000/manual_time                      12671 ns        28658 ns        55261 bytes_per_second=73.5018G/s items_per_second=78.922G/s
...

$ cpp/build/benchmarks/REDUCTION_NVBENCH
...
## rank_scan

### [0] NVIDIA RTX A5500

|        T        | null_probability | data_size | Samples |  CPU Time  | Noise  |  GPU Time  | Noise |  Elem/s  | GlobalMem BW |  BWUtil   |
|-----------------|------------------|-----------|---------|------------|--------|------------|-------|----------|--------------|-----------|
|             I32 |                0 |     10000 |  16992x |  33.544 us | 14.95% |  29.446 us | 5.58% |  82.321M |   5.596 TB/s |   728.54% |
|             I32 |              0.1 |     10000 |  16512x |  34.358 us | 13.66% |  30.292 us | 2.87% |  80.020M |   5.286 TB/s |   688.17% |
|             I32 |              0.5 |     10000 |  16736x |  34.058 us | 14.31% |  29.890 us | 3.40% |  81.097M |   5.430 TB/s |   706.89% |
...
```

Note that, when the data type is a 1-byte-width type in the google benchmark result summary, `bytes_per_second` appears to be smaller than `items_per_second`. This is because the former is a multiple of 1000 whereas the latter is a multiple of 1024. They are in fact the same number.

Implementation-wise, these are what I'm not sure if I made a best decision.
- Each of new utils above is declared and defined in different files. I did this because I could not find a good place to have them all, and they seem to belong to different utilities. Please let me know if there is a better place for them.
- All the new utils are defined in the global namespace since other util functions seem to have been defined in the same way. Please let me know if this is not the convention.

Authors:
  - Jihoon Son (https://github.com/jihoonson)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - David Wendt (https://github.com/davidwendt)

URL: #16126
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants