Releases: ROCm/hipCUB
hipCUB 3.3.0 for ROCm 6.3.0
Added
- Support for large indices in
hipcub::DeviceSegmentedReduce::*
has been added, with the exception ofDeviceSegmentedReduce::Arg*
. Although rocPRIM's backend provides support for all reduce variants, CUB does not support large indices inDeviceSegmentedReduce::Arg*
. For this reason, large index support is not available forhipcub::DeviceSegmentedReduce::Arg*
.
Changed
- Changed the default value of
rmake.py -a
todefault_gpus
. This is equivalent togfx906:xnack-,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201
. - The NVIDIA backend now requires CUB, Thrust, and libcu++ 2.3.2.
Resolved issues
- Fixed an issue in
rmake.py
where the list storing cmake options would contain individual characters instead of a full string of options. - Fixed an issue where
config.hpp
was not included in all hipCUB headers, resulting in build errors.
hipCUB 3.2.1 for ROCm 6.2.4
Added
- GFX1151 Support
hipCUB 3.2.0 for ROCm 6.2.2
hipCUB code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.
hipCUB 3.2.0 for ROCm 6.2.1
hipCUB code for ROCm 6.2.1 did not change. The library was rebuilt for the updated ROCm 6.2.1 stack.
hipCUB 3.2.0 for ROCm 6.2.0
Added
-
Add
DeviceCopy
function to have parity with CUB. -
In the rocPRIM backend, added
enum WarpExchangeAlgorithm
, which is used as the new optional template argument forWarpExchange
.- The potential values for the enum are
WARP_EXCHANGE_SMEM
andWARP_EXCHANGE_SHUFFLE
. WARP_EXCHANGE_SMEM
stands for the previous algorithm, whileWARP_EXCHANGE_SHUFFLE
performs the exchange via shuffle operations.WARP_EXCHANGE_SHUFFLE
does not require any pre-allocated shared memory, but theItemsPerThread
must be a divisor ofWarpSize
.
- The potential values for the enum are
-
Added
tuple.hpp
which defines templateshipcub::tuple
,hipcub::tuple_element
,hipcub::tuple_element_t
andhipcub::tuple_size
. -
Added new overloaded member functions to
BlockRadixSort
andDeviceRadixSort
that expose adecomposer
argument. Keys of a custom
type (key_type
) can be sorted via these overloads, if an appropriate decomposer is passed. The decomposer has to implement
operator(const key_type&)
which returns ahipcub::tuple
of references pointing to members ofkey_type
. -
On AMD GPUs (using the HIP backend), it is possible to issue hipCUB API calls inside of
hipGraphs, with several exceptions:- CachingDeviceAllocator
- GridBarrierLifetime
- DeviceSegmentedRadixSort
- DeviceRunLengthEncode
Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of hipGraphs.
Changed
- The NVIDIA backend now requires CUB, Thrust and libcu++ 2.2.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.
Fixed
- Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB.
It now derives the accumulator type as the result of the binary operator. debug_synchronous
has been deprecated in hipCUB-2.13.2, and it no longer has any effect. With this release, passingdebug_synchronous
to the device functions results in a deprecation warning both at runtime and at compile time.- The synchronization that was previously achievable by passing
debug_synchronous=true
can now be achieved at compile time
by setting theCUB_DEBUG_SYNC
(or higher debug level) or theHIPCUB_DEBUG_SYNC
preprocessor definition. - The compile time deprecation warnings can be disabled by defining the
HIPCUB_IGNORE_DEPRECATED_API
preprocessor definition.
- The synchronization that was previously achievable by passing
hipCUB 3.1.0 for ROCm 6.1.2
hipCUB code for ROCm 6.1.2 did not change. The library was rebuilt for the updated ROCm 6.1.2 stack.
hipCUB 3.1.0 for ROCm 6.1.1
hipCUB code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.
hipCUB 3.1.0 for ROCm 6.1.0
Changed
- CUB backend references CUB and Thrust version 2.1.0.
- Updated
HIPCUB_HOST_WARP_THREADS
macro definition to matchhost_warp_size
changes from rocPRIM 3.0. - Implemented
__int128_t
and__uint128_t
support for radix_sort.
Fixed
- Fixed build issues with
rmake.py
on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.
Added
- Added interface
DeviceMemcpy::Batched
for batched memcpy from rocPRIM and CUB.
hipCUB 3.0.0 for ROCm 6.0.2
hipCUB code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.
hipCUB 3.0.0 for ROCm 6.0.0
Changed
- Removed
DOWNLOAD_ROCPRIM
, forcing rocPRIM to download can be done withDEPENDENCIES_FORCE_DOWNLOAD
.