v.0.4.0
What's Changed
This release includes a new CMake build process, new and improved autotuning configuration options, and compilation fixes for newer NVHPC releases with CUTENSOR 2.0. This release also includes initial opt-in support for NCCL User Buffer registration.
Breaking changes
- #21 changed the attribute
transpose_use_inplace_buffers
incudecompGridDescAutotuneOptions_t
to an array of boolean values from a single value. This will require updates to C++ code using this autotuning option.
PRs included in this release
- Allow to skip certain transpose operations during autotuning. (#16)
- Remove unneeded 4 GPU restriction on Fortran autotune example. (#17)
- Add CMake build (#15)
- Enable autotuner to skip slow configurations via new skip_threshold option (#18)
- Lowering CMake build optimization level for host code (#19)
- Add support for CUTENSOR 2.0. (#20)
- Enable per operation setting for in-place usage when autotuning. (#21)
- Enable applying weights to individual transpose operation timings during autotuning (#22)
- Add support for NCCL user buffer registration (#23)
- Add MPI_Barrier call in NCCL initialization code. (#24)
- Make CMake detection of NVHPC compilers more robust (#26)
- Move NVSHMEM kernels into separate file to limit application of -rdc=true. (#28)
Full Changelog: v0.3.1...v0.4.0