Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bytes_per_second to transpose benchmark #14170

Merged
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 19 additions & 6 deletions cpp/benchmarks/transpose/transpose.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -40,16 +40,29 @@ static void BM_transpose(benchmark::State& state)
cuda_event_timer raii(state, true);
auto output = cudf::transpose(input);
}

// Collect memory statistics.
auto const bytes_read = input.num_columns() * input.num_rows() * (sizeof(int32_t));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to avoid potential future type mismatches that result in wrong bytes/s reports. So I think you should stash the type_id in a variable above:

constexpr auto column_type = cudf::type_id::INT32;

And then here use CUDF's id_to_type utility:

Suggested change
auto const bytes_read = input.num_columns() * input.num_rows() * (sizeof(int32_t));
auto const bytes_read = input.num_columns() * input.num_rows() * (sizeof(cudf::id_to_type(column_type)));

See https://docs.rapids.ai/api/libcudf/stable/group__utility__dispatcher#gad7e12b8accf60e7c0e500294e1ee8536

auto const bytes_written = bytes_read;
// Account for nullability in input and output.
auto const null_bytes =
2 * input.num_columns() * cudf::bitmask_allocation_size_bytes(input.num_rows());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: ‏This one could also overflow, I think, perhaps:

Suggested change
2 * input.num_columns() * cudf::bitmask_allocation_size_bytes(input.num_rows());
2 * static_cast<uint64_t>(input.num_columns()) * cudf::bitmask_allocation_size_bytes(input.num_rows());

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about this one? Since the return type of cudf::bitmask_allocation_size_bytes is std::size_t which is either unsigned long or unsigned long long so for reasonable input sizes the integer type promotion will avoid the overflow (https://cppinsights.io/s/26f977cb).

That said, just having this discussion indicates I should've included an explicit cast upfront to clear up any potential confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left-to-right associativity means that this is evaluated as (2 * ncol) * nrow, the first multiplication is performed in size_type (AKA, int32_t), so that could overflow, no? Although I think these benchmarks are generally run with fewer than $2^{30}$ rows, there's in general no reason why they couldn't be (although the transpose performance will be terrible I grant you).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just always put the thing that returns size_t (sizeof or bitmask_allocation_size_bytes) first in the arithmetic in all of these PRs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I would keep the cast explicit to make visible what is happening, but I don't have a strong stance on it.


state.SetBytesProcessed(static_cast<int64_t>(state.iterations()) *
(bytes_read + bytes_written + null_bytes));
}

class Transpose : public cudf::benchmark {};

#define TRANSPOSE_BM_BENCHMARK_DEFINE(name) \
BENCHMARK_DEFINE_F(Transpose, name)(::benchmark::State & state) { BM_transpose(state); } \
BENCHMARK_REGISTER_F(Transpose, name) \
->RangeMultiplier(4) \
->Range(4, 4 << 13) \
->UseManualTime() \
#define TRANSPOSE_BM_BENCHMARK_DEFINE(name) \
BENCHMARK_DEFINE_F(Transpose, name)(::benchmark::State & state) \
{ \
BM_transpose(state); \
} \
BENCHMARK_REGISTER_F(Transpose, name) \
->RangeMultiplier(4) \
->Range(4, 4 << 13) \
->UseManualTime() \
->Unit(benchmark::kMillisecond);

TRANSPOSE_BM_BENCHMARK_DEFINE(transpose_simple);