Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark all cuco kernels as static so they have hidden visibility #422

Merged
merged 4 commits into from
Jan 19, 2024

Conversation

robertmaynard
Copy link
Collaborator

This marks all kernels in CUCO as static so that they have internal linkage and won't conflict when used by multiple DSOs.

I didn't see a single shared/common header in cuco where I could place a CUCO_KERNEL macro so I modified each instance instead.
While cccl went with a __attribute__ ((visibility ("hidden"))) approach to help reduce RDC size, this approach seemed very invasive for cuco. This is due to the fact that we would need to pragma push and pop both gcc warnings and nvcc warnings in each cuco header so that we don't introduce any warnings. This is needed as the compiler incorrectly state that the __attribute__ ((visibility ("hidden"))) has no side-effect.

Context:
rapidsai/cudf#14726
NVIDIA/cccl#166
rapidsai/raft#1722

Copy link

copy-pr-bot bot commented Jan 9, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@PointKernel PointKernel added the type: improvement Improvement / enhancement to an existing function label Jan 9, 2024
@PointKernel
Copy link
Member

/ok to test

@PointKernel
Copy link
Member

PointKernel commented Jan 9, 2024

I didn't see a single shared/common header in cuco where I could place a CUCO_KERNEL macro so I modified each instance instead.

https://github.com/NVIDIA/cuCollections/blob/dev/include/cuco/detail/utility/cuda.hpp probably the best place for such a macro

@robertmaynard robertmaynard force-pushed the bug/mark_kernels_as_hidden branch from a8e41b5 to b520b69 Compare January 10, 2024 15:26
@robertmaynard
Copy link
Collaborator Author

I didn't see a single shared/common header in cuco where I could place a CUCO_KERNEL macro so I modified each instance instead.

https://github.com/NVIDIA/cuCollections/blob/dev/include/cuco/detail/utility/cuda.hpp probably the best place for such a macro

Updated to introduce a CUCO_KERNEL macro to cuda.hpp.

Copy link
Collaborator

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cuco is header-only and may be included in projects that are building with rdc=true, I think it is more important to preserve the __attribute__ ((visibility ("hidden")) functionality when __CUDA_RDC__ is defined.

As you mentioned, you will need to universally silence the -Wattributes warning as seen here: https://github.com/NVIDIA/cccl/blob/2771c61545eff4ec3cede24ba7963c4eebc4bbaf/cub/cub/util_macro.cuh#L128-L140

@robertmaynard
Copy link
Collaborator Author

robertmaynard commented Jan 10, 2024

The issue I have with CCCL approach is that it has side effects on all functions inside the TU and not just the CCCL functions since the suppressions aren't push/popped.

This could hide real issues in user code, and why I don't recommend universally silencing the warnings. But I will implement what ever approach the CuCo team wants, which by the sound of it is universal suppression

@sleeepyjack
Copy link
Collaborator

sleeepyjack commented Jan 10, 2024

I'm trying to wrap my head around this. So static is still correct, but we have an increased binary size in case rdc=true?

Using the hidden attribute would also help with the binary size, but adding pragma push/pop to every header (does it need to be every header or just those that directly emit the warning?) is very invasive. Can we prefix the CUCO_KERNEL macro with #pragma warning ( suppress: FooWarning) instead?

@robertmaynard
Copy link
Collaborator Author

I'm trying to wrap my head around this. So static is still correct, but we have an increased binary size in case rdc=true?

Correct.

Using the hidden attribute would also help with the binary size, but adding pragma push/pop to every header (does it need to be every header or just those that directly emit the warning?) is very invasive. Can we prefix the CUCO_KERNEL macro with #pragma warning ( suppress: FooWarning) instead?

It would help with binary size only in case of rdc=true. You are correct it will only need to be the headers that have kernels.

Can we prefix the CUCO_KERNEL macro with #pragma warning ( suppress: FooWarning) instead?
You can but that will effect all following code, which would include user code inside any translation unit that included the header.

@sleeepyjack
Copy link
Collaborator

You can but that will effect all following code, which would include user code inside any translation unit that included the header.

Ah, ok, I thought #pragma warning ( suppress: would only affect the line that directly follows, not the entire TU.

I would prefer the hidden annotation solution over static, but if it's too much work I'm fine with the latter too.

Most of the kernels in cuco are already isolated in separate header files, so we would need to add the push/pop pragmas to 8 files in total (ignoring any kernels in test/benchmark directories):

  • include/cuco/detail/open_addressing/kernels.cuh
  • include/cuco/detail/static_map/kernels.cuh
  • include/cuco/detail/static_multimap/kernels.cuh
  • include/cuco/detail/static_set/kernels.cuh
  • include/cuco/detail/trie/dynamic_bitset/kernels.cuh
  • include/cuco/detail/storage/kernels.cuh
  • include/cuco/detail/dynamic_map_kernels.cuh
  • include/cuco/detail/static_map_kernels.cuh (legacy map impl)

@robertmaynard
Copy link
Collaborator Author

I would prefer the hidden annotation solution over static

I will go forward with that approach 👍

@robertmaynard
Copy link
Collaborator Author

@sleeepyjack
After testing with the cuco examples and libcudf, we can't pop the macros as desired.
This due to how the __attribute__ ((visibility ("hidden"))) is applied during nvcc internal code generation. Consuming TUs will also generate attribute warnings and those can only be captured by not popping the pragmas.

@robertmaynard robertmaynard force-pushed the bug/mark_kernels_as_hidden branch from 404006a to c277ab7 Compare January 11, 2024 17:05
@PointKernel
Copy link
Member

/ok to test

@sleeepyjack
Copy link
Collaborator

sleeepyjack commented Jan 11, 2024

Consuming TUs will also generate attribute warnings and those can only be captured by not popping the pragmas.

So this means we globally disable this warning in user land? Not ideal. @jrhemstad any ideas?

@jrhemstad
Copy link
Collaborator

So this means we globally disable this warning in user land? Not ideal. @jrhemstad any ideas?

That's precisely what we did in every CCCL header. It's a pretty innocuous warning to silence, and the benefits far outweigh any disadvantages.

rapids-bot bot pushed a commit to rapidsai/rapids-cmake that referenced this pull request Jan 18, 2024
Patch both CCCL and CUCO to have only internal linkage.

For cuco I am working on upstreaming these changes ( NVIDIA/cuCollections#422 ). Once that is accepted and we have validated that moving cuco is stable ( e.g. changes around `cuco::experimental::static_set` ) we can drop this patch set.

For cccl the long term fix is to move to CCCL 2.3+, but due to issues ( NVIDIA/cccl#1249, maybe others ) that isn't viable for the 24.02 timeframe.
Since the CCCL changes mean C++ and CUDA sources have non compatible ABI's, we need to specify `THRUST_DISABLE_ABI_NAMESPACE` and `THRUST_IGNORE_ABI_NAMESPACE_ERROR` so that we don't change ABI in rapids-cmake consumers since they expect 2.2 behavior.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #523
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with @sleeepyjack earlier today, this PR is ready to ship once we revert changes in the example. Examples are considered as user code thus no to need change it.

examples/static_set/device_subsets_example.cu Outdated Show resolved Hide resolved
examples/static_set/device_subsets_example.cu Outdated Show resolved Hide resolved
@robertmaynard
Copy link
Collaborator Author

Checked with @sleeepyjack earlier today, this PR is ready to ship once we revert changes in the example. Examples are considered as user code thus no to need change it.

I have integrated the changes to examples. So you can merge whenever you are ready.
Thanks!

@sleeepyjack
Copy link
Collaborator

/ok to test

@PointKernel PointKernel merged commit 75c9613 into NVIDIA:dev Jan 19, 2024
15 checks passed
PointKernel pushed a commit to PointKernel/rapids-cmake that referenced this pull request Jan 23, 2024
Patch both CCCL and CUCO to have only internal linkage.

For cuco I am working on upstreaming these changes ( NVIDIA/cuCollections#422 ). Once that is accepted and we have validated that moving cuco is stable ( e.g. changes around `cuco::experimental::static_set` ) we can drop this patch set.

For cccl the long term fix is to move to CCCL 2.3+, but due to issues ( NVIDIA/cccl#1249, maybe others ) that isn't viable for the 24.02 timeframe.
Since the CCCL changes mean C++ and CUDA sources have non compatible ABI's, we need to specify `THRUST_DISABLE_ABI_NAMESPACE` and `THRUST_IGNORE_ABI_NAMESPACE_ERROR` so that we don't change ABI in rapids-cmake consumers since they expect 2.2 behavior.

Authors:
  - Robert Maynard (https://github.com/robertmaynard)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#523
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Mar 5, 2024
This is to remove the row conversion code from libcudf. It was move from spark-rapids-jni (by #14664) to temporarily workaround the issue due to conflict of kernel names that causes invalid memory access when calling to `thrust::in(ex)clusive_scan` (NVIDIA/spark-rapids-jni#1567).

Now we have fixes for the namespace visibility issue (by marking all libcudf kenels private in rapidsai/rapids-cmake#523 and NVIDIA/cuCollections#422) and need to move back the code.

Closes #14853.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #15234
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: improvement Improvement / enhancement to an existing function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants