Skip to content

Releases: NVIDIA/cuDecomp

v0.4.2

30 Oct 17:52
7703aa0
Compare
Choose a tag to compare

What's Changed

This patch release fixes several build related issues, including updating CMake include search paths for NVSHMEM 3.x support and improper naming of the single precision C2C benchmark executable. Other changes include small corrections to command line argument handling in the benchmark program and functionality updates to the Tayor Green example.

PRs included in this release

  • Update CMake NVSHMEM include search paths for NVSHMEM 3.x. (#34)
  • Fix integer conversion of skip_threshold in benchmark program. (#35)
  • Fix scaling overflow for large grids in R2C benchmark. Correct compilation defines for single precision C2C benchmark. (#37)
  • Taylor Green example updates. (#36)

Full Changelog: v0.4.1...v0.4.2

v0.4.1

20 Apr 23:44
Compare
Choose a tag to compare

What's Changed

This patch release fixes a bug in processor dims handling during autotuning when supplying a fixed process grid introduced in v0.4.0.

PRs included in this release

  • Fix transposed pdims during autotuning. (#29)
  • Make CMake library include directory handling more robust. (#30)

Full Changelog: v0.4.0...v0.4.1

v.0.4.0

14 Mar 19:30
b8ffecc
Compare
Choose a tag to compare

What's Changed

This release includes a new CMake build process, new and improved autotuning configuration options, and compilation fixes for newer NVHPC releases with CUTENSOR 2.0. This release also includes initial opt-in support for NCCL User Buffer registration.

Breaking changes

  • #21 changed the attributetranspose_use_inplace_buffers in cudecompGridDescAutotuneOptions_t to an array of boolean values from a single value. This will require updates to C++ code using this autotuning option.

PRs included in this release

  • Allow to skip certain transpose operations during autotuning. (#16)
  • Remove unneeded 4 GPU restriction on Fortran autotune example. (#17)
  • Add CMake build (#15)
  • Enable autotuner to skip slow configurations via new skip_threshold option (#18)
  • Lowering CMake build optimization level for host code (#19)
  • Add support for CUTENSOR 2.0. (#20)
  • Enable per operation setting for in-place usage when autotuning. (#21)
  • Enable applying weights to individual transpose operation timings during autotuning (#22)
  • Add support for NCCL user buffer registration (#23)
  • Add MPI_Barrier call in NCCL initialization code. (#24)
  • Make CMake detection of NVHPC compilers more robust (#26)
  • Move NVSHMEM kernels into separate file to limit application of -rdc=true. (#28)

Full Changelog: v0.3.1...v0.4.0

v0.3.1

18 May 16:17
Compare
Choose a tag to compare
v0.3.1 Pre-release
Pre-release

This patch release includes bug fixes in the handling of large message sizes with NVSHMEM backend.

Bugfixes:

  • Fixed handling of large message sizes in NVSHMEM backend. (#13)

v0.3.0

24 Apr 17:01
Compare
Choose a tag to compare
v0.3.0 Pre-release
Pre-release

This release includes bug fixes in the handling of user-provided MPI communicators and processor grid configurations yielding empty pencils.

Bugfixes:

  • Fixed handling of user-provided MPI communicators. (#7)
  • Fixed handling of processor grid configurations yielding empty pencils. (#11, #12)

v0.2.0

07 Sep 19:22
Compare
Choose a tag to compare
v0.2.0 Pre-release
Pre-release

This release includes some minor bug fixes and quality of life improvements.

Changes:

  • Renaming of optional arguments in Fortran interface. (#2)

Bugfixes:

  • Fixed indexing bug in cudecompGetShiftedRank in Fortran interface. (#1)
  • Fixed bug with NCCL resource reclamation when using multiple grid descriptors. (#4)