Skip to content

v.0.4.0

Compare
Choose a tag to compare
@romerojosh romerojosh released this 14 Mar 19:30
· 14 commits to main since this release
b8ffecc

What's Changed

This release includes a new CMake build process, new and improved autotuning configuration options, and compilation fixes for newer NVHPC releases with CUTENSOR 2.0. This release also includes initial opt-in support for NCCL User Buffer registration.

Breaking changes

  • #21 changed the attributetranspose_use_inplace_buffers in cudecompGridDescAutotuneOptions_t to an array of boolean values from a single value. This will require updates to C++ code using this autotuning option.

PRs included in this release

  • Allow to skip certain transpose operations during autotuning. (#16)
  • Remove unneeded 4 GPU restriction on Fortran autotune example. (#17)
  • Add CMake build (#15)
  • Enable autotuner to skip slow configurations via new skip_threshold option (#18)
  • Lowering CMake build optimization level for host code (#19)
  • Add support for CUTENSOR 2.0. (#20)
  • Enable per operation setting for in-place usage when autotuning. (#21)
  • Enable applying weights to individual transpose operation timings during autotuning (#22)
  • Add support for NCCL user buffer registration (#23)
  • Add MPI_Barrier call in NCCL initialization code. (#24)
  • Make CMake detection of NVHPC compilers more robust (#26)
  • Move NVSHMEM kernels into separate file to limit application of -rdc=true. (#28)

Full Changelog: v0.3.1...v0.4.0