Full documentation for hipCUB is available at https://rocm.docs.amd.com/projects/hipCUB/en/latest/.
- Added regression tests to
rtest.py
. These tests recreate scenarios that have caused hardware problems in past emulation environments. Usepython rtest.py [--emulation|-e|--test|-t]=regression
to run these tests. - Added extended tests to
rtest.py
. These tests are extra tests that did not fit the criteria of smoke and regression tests. These tests will take much longer to run relative to smoke and regression tests. Usepython rtest.py [--emulation|-e|--test|-t]=extended
to run these tests. - Added
ForEach
,ForEachN
,ForEachCopy
,ForEachCopyN
andBulk
functions to have parity with CUB. - Added the
hipcub::CubVector
type for CUB parity. - Added
--emulation
option forrtest.py
- Unit tests can be run with
[--emulation|-e|--test|-t]=<test_name>
- Unit tests can be run with
- Added
DeviceSelect::FlaggedIf
and its inplace overload.
- Changed the subset of tests that are run for smoke tests such that the smoke test will complete with faster run-time and to never exceed 2GB of vram usage. Use
python rtest.py [--emulation|-e|--test|-t]=smoke
to run these tests. - The
rtest.py
options have changed.rtest.py
is now run with at least either--test|-t
or--emulation|-e
, but not both options. - The NVIDIA backend now requires CUB, Thrust and libcu++ 2.5.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.
- Changed the C++ version from 14 to 17. C++14 will be deprecated in the next major release.
- Support for large indices in
hipcub::DeviceSegmentedReduce::*
has been added, with the exception ofDeviceSegmentedReduce::Arg*
. Although rocPRIM's backend provides support for all reduce variants, CUB does not support large indices inDeviceSegmentedReduce::Arg*
. For this reason, large index support is not available forhipcub::DeviceSegmentedReduce::Arg*
.
- Changed the default value of
rmake.py -a
todefault_gpus
. This is equivalent togfx906:xnack-,gfx1030,gfx1100,gfx1101,gfx1102,gfx1151,gfx1200,gfx1201
. - The NVIDIA backend now requires CUB, Thrust, and libcu++ 2.3.2.
- Fixed an issue in
rmake.py
where the list storing cmake options would contain individual characters instead of a full string of options. - Fixed an issue where
config.hpp
was not included in all hipCUB headers, resulting in build errors.
-
Add
DeviceCopy
function to have parity with CUB. -
In the rocPRIM backend, added
enum WarpExchangeAlgorithm
, which is used as the new optional template argument forWarpExchange
.- The potential values for the enum are
WARP_EXCHANGE_SMEM
andWARP_EXCHANGE_SHUFFLE
. WARP_EXCHANGE_SMEM
stands for the previous algorithm, whileWARP_EXCHANGE_SHUFFLE
performs the exchange via shuffle operations.WARP_EXCHANGE_SHUFFLE
does not require any pre-allocated shared memory, but theItemsPerThread
must be a divisor ofWarpSize
.
- The potential values for the enum are
-
Added
tuple.hpp
which defines templateshipcub::tuple
,hipcub::tuple_element
,hipcub::tuple_element_t
andhipcub::tuple_size
. -
Added new overloaded member functions to
BlockRadixSort
andDeviceRadixSort
that expose adecomposer
argument. Keys of a custom type (key_type
) can be sorted via these overloads, if an appropriate decomposer is passed. The decomposer has to implementoperator(const key_type&)
which returns ahipcub::tuple
of references pointing to members ofkey_type
. -
On AMD GPUs (using the HIP backend), it is possible to issue hipCUB API calls inside of hipGraphs, with several exceptions:
- CachingDeviceAllocator
- GridBarrierLifetime
- DeviceSegmentedRadixSort
- DeviceRunLengthEncode Currently, these classes rely on one or more synchronous calls to function correctly. Because of this, they cannot be used inside of hipGraphs.
- The NVIDIA backend now requires CUB, Thrust and libcu++ 2.2.0. If it is not found it will be downloaded from the NVIDIA CCCL repository.
- Fixed the derivation for the accumulator type for device scan algorithms in the rocPRIM backend being different compared to CUB. It now derives the accumulator type as the result of the binary operator.
debug_synchronous
has been deprecated in hipCUB-2.13.2, and it no longer has any effect. With this release, passingdebug_synchronous
to the device functions results in a deprecation warning both at runtime and at compile time.- The synchronization that was previously achievable by passing
debug_synchronous=true
can now be achieved at compile time by setting theCUB_DEBUG_SYNC
(or higher debug level) or theHIPCUB_DEBUG_SYNC
preprocessor definition. - The compile time deprecation warnings can be disabled by defining the
HIPCUB_IGNORE_DEPRECATED_API
preprocessor definition.
- The synchronization that was previously achievable by passing
- CUB backend references CUB and Thrust version 2.1.0
- Updated the
HIPCUB_HOST_WARP_THREADS
macro definition to matchhost_warp_size
changes from rocPRIM 3.0 - Implemented
__int128_t
and__uint128_t
support forradix_sort
- Build issues with
rmake.py
on Windows when using Visual Studio 2017 15.8 or later (due to a breaking fix with extended aligned storage)
- Interface
DeviceMemcpy::Batched
for batched memcpy from rocPRIM and CUB
- Removed
DOWNLOAD_ROCPRIM
- You can force rocPRIM to download using
DEPENDENCIES_FORCE_DOWNLOAD
- You can force rocPRIM to download using
- CUB backend references CUB and Thrust version 2.0.1.
- Fixed
DeviceSegmentedReduce::ArgMin
andDeviceSegmentedReduce::ArgMax
by returning the segment-relative index instead of the absolute one - Fixed
DeviceSegmentedReduce::ArgMin
for inputs where the segment minimum is smaller than the value returned for empty segments; an equivalent fix is applied toDeviceSegmentedReduce::ArgMax
debug_synchronous
no longer works on the CUDA platform; useCUB_DEBUG_SYNC
insteadDeviceReduce::Sum
doesn't compile on the CUDA platform for mixed extended-floating-point or floating-point InputT and OutputT typesDeviceHistogram::HistogramEven
fails on CUDA platform for[LevelT, SampleIteratorT] = [int, int]
.DeviceHistogram::MultiHistogramEven
fails on CUDA platform for[LevelT, SampleIteratorT] = [int, int/unsigned short/float/double]
and[LevelT, SampleIteratorT] = [float, double]
- Benchmarks for
BlockShuffle
,BlockLoad
, andBlockStore
- The CUB backend references CUB and Thrust version 1.17.2
- Improved benchmark coverage for:
BlockScan
by addingExclusiveScan
BlockRadixSort
by addingSortBlockedToStriped
WarpScan
by addingBroadcast
- Removed references to, and workarounds for, the deprecated hcc
BlockRadixRankMatch
is currently broken for the rocPRIM backendBlockRadixRankMatch
with a warp size that does not divide exactly by the block size is broken for the CUB backend
- CMake functionality improves build parallelism for the test suite that splits compilation units by function or parameters
- New overload for
BlockAdjacentDifference::SubtractLeftPartialTile
, which takes a predecessor item
- Improved build parallelism of the test suite by splitting up large compilation units for
DeviceRadixSort
,DeviceSegmentedRadixSort
, andDeviceSegmentedSort
- The CUB backend references CUB and Thrust version 1.17.1
BlockRadixRankMatch
is currently broken for the rocPRIM backendBlockRadixRankMatch
with a warp size that does not divide exactly by the block size is broken for the CUB backend
UniqueByKey
device algorithmSubtractLeft
,SubtractLeftPartialTile
,SubtractRight
, andSubtractRightPartialTile
overload inBlockAdjacentDifference
- The old overloads (
FlagHeads
,FlagTails
,FlagHeadsAndTails
) are deprecated
- The old overloads (
DeviceAdjacentDifference
algorithm- Extended benchmark suite of
DeviceHistogram
,DeviceScan
,DevicePartition
,DeviceReduce
,DeviceSegmentedReduce
,DeviceSegmentedRadixSort
,DeviceRadixSort
,DeviceSpmv
,DeviceMergeSort
, andDeviceSegmentedSort
- Obsolete type traits defined in
util_type.hpp
; use the standard library equivalents instead - CUB backend references CUB and Thrust version 1.16.0
-
DeviceRadixSort
num_items
parameter type is now templated instead of being an int- If an integral type with a maximum size of 4 bytes is passed (an int), the former logic applies;
otherwise, the algorithm uses a larger indexing type that makes it possible to sort input data over
$2^{32}$ elements
- If an integral type with a maximum size of 4 bytes is passed (an int), the former logic applies;
otherwise, the algorithm uses a larger indexing type that makes it possible to sort input data over
- Packages for tests and benchmark executables on all supported operating systems using CPack
- Device segmented sort
WarpMergeSort
,WarpMask
, and thread sort from CUB 1.15.0 are supported in hipCUB- Device three-way partition
device_scan
anddevice_segmented_scan
:inclusive_scan
now uses theinput-type
asaccumulator-type
;exclusive_scan
usesinitial-value-type
.- This changes the behavior of:
- Small-size input types with large-size output types (e.g., short input, int output)
- Low-res input with high-res output (e.g., float input, double output)
- This changes the behavior of:
- Block merge sort no longer supports non-power of two block sizes
- Grid unit test hangs on HIP for Windows
- Added missing includes to
hipcub.hpp
- Bfloat16 support to test cases (
device_reduce
anddevice_radix_sort
) - Device merge sort
- Block merge sort
- API update to CUB 1.14.0
- The
SetupNVCC.cmake
automatic target selector selects all of the capabilities for all available cards with the NVIDIA backend
- Initial HIP on Windows support
- Packaging changed to a development package (named
hipcub-dev
for.deb
packages andhipcub-devel
for.rpm
packages). Because hipCUB is a header-only library, there is no runtime package. To aid in the transition, the development package sets theprovides
field tohipcub
, so existing packages that are dependent on hipCUB can continue to work. Thisprovides
feature is introduced as a deprecated feature because it will be removed in a future ROCm release.
- gfx1030 support added
- AddressSanitizer build option
BlockRadixRank
unit test failure
DiscardOutputIterator
to backend header
- Support for
TexObjInputIterator
andTexRefInputIterator
- Support for
DevicePartition
- The minimum CMake version required is now 3.10.2
- The CUB backend has been updated to 1.11.0
- Benchmark build
- NVCC build
- Support for
DiscardOutputIterator
- No changes
- No changes
- No changes
- No changes
- No changes
- No changes
- Improved tests with fixed and random seeds for test data
- Switched to hip-clang as default compiler
- CMake searches for rocPRIM locally first and, if not found, downloads it from GitHub
- HCC build
- The following unit test failures (due to issues in ROCclr runtime) have been observed:
BlockDiscontinuity
BlockExchange
BlockHistogram
BlockRadixSort
BlockReduce
BlockScan