Releases: NVIDIA/cuDecomp
v0.4.2
What's Changed
This patch release fixes several build related issues, including updating CMake include search paths for NVSHMEM 3.x support and improper naming of the single precision C2C benchmark executable. Other changes include small corrections to command line argument handling in the benchmark program and functionality updates to the Tayor Green example.
PRs included in this release
- Update CMake NVSHMEM include search paths for NVSHMEM 3.x. (#34)
- Fix integer conversion of skip_threshold in benchmark program. (#35)
- Fix scaling overflow for large grids in R2C benchmark. Correct compilation defines for single precision C2C benchmark. (#37)
- Taylor Green example updates. (#36)
Full Changelog: v0.4.1...v0.4.2
v0.4.1
What's Changed
This patch release fixes a bug in processor dims handling during autotuning when supplying a fixed process grid introduced in v0.4.0.
PRs included in this release
- Fix transposed pdims during autotuning. (#29)
- Make CMake library include directory handling more robust. (#30)
Full Changelog: v0.4.0...v0.4.1
v.0.4.0
What's Changed
This release includes a new CMake build process, new and improved autotuning configuration options, and compilation fixes for newer NVHPC releases with CUTENSOR 2.0. This release also includes initial opt-in support for NCCL User Buffer registration.
Breaking changes
- #21 changed the attribute
transpose_use_inplace_buffers
incudecompGridDescAutotuneOptions_t
to an array of boolean values from a single value. This will require updates to C++ code using this autotuning option.
PRs included in this release
- Allow to skip certain transpose operations during autotuning. (#16)
- Remove unneeded 4 GPU restriction on Fortran autotune example. (#17)
- Add CMake build (#15)
- Enable autotuner to skip slow configurations via new skip_threshold option (#18)
- Lowering CMake build optimization level for host code (#19)
- Add support for CUTENSOR 2.0. (#20)
- Enable per operation setting for in-place usage when autotuning. (#21)
- Enable applying weights to individual transpose operation timings during autotuning (#22)
- Add support for NCCL user buffer registration (#23)
- Add MPI_Barrier call in NCCL initialization code. (#24)
- Make CMake detection of NVHPC compilers more robust (#26)
- Move NVSHMEM kernels into separate file to limit application of -rdc=true. (#28)
Full Changelog: v0.3.1...v0.4.0
v0.3.1
v0.3.0
v0.2.0
This release includes some minor bug fixes and quality of life improvements.
Changes:
- Renaming of optional arguments in Fortran interface. (#2)
Bugfixes: