Skip to content

v9.0.0 release

Compare
Choose a tag to compare
@xiaoyeli xiaoyeli released this 08 May 20:25
· 79 commits to master since this release

V9.0.0 Release note

An example program is EXAMPLE/pddrive3d.c, calling the driver routine: SRC/double/pgssvx3d.c (or pdgssvx3d_csc_batch.c)

Please cite this ACM TOMS paper when you use these new features.

OpenMP performance hit:
On many systems, the default OMP_NUM_THREADS is set to to be the total number of CPU cores on a node. For example, it is set to be 128 on Perlmutter at NERSC. This is too high, because most of the algorithms are not efficient in the pure threading mode. We recommend users to experiment with mixed MPI and OpenMP mode, starting with smaller thread count, by settiing:
export OMP_NUM_THREADS=1, or 2, or 3, ....

The new features include the following:

  1. LU factorization: diagonal factorization, panel factorization, & Schur-complement update
    can all offloaded to GPU
    Environment variables:

    • export SUPERLU_ACC_OFFLOAD=1 (default setting: enable GPU)
      - export GPU3DVERSION=1 (default setting; use code in CplusplusFactor/ for all offload )
      - export GPU3DVERSION=0 (only Schur-complement updates are offloaded)
  2. Triangular solve: new 3D communication-avoiding code
    Environment variable:
    export SUPERLU_ACC_SOLVE=0 (default setting; only on CPU)
    export SUPERLU_ACC_SOLVE=1 (offload to GPU)

    ** NOTE: when using multiple NVIDIA GPUs per 2D grid for GPU triangular solve, we use NVSHMEM for fast
    inter-GPU communication. You need to configure NVSHMEM properly.
    For example, on Perlmutter at NERSC, we need the following setup:
    module load nvshmem/2.11.0
    export NVSHMEM_HOME=/global/common/software/nersc9/nvshmem/2.11.0

      export NVSHMEM_USE_GDRCOPY=1
      export NVSHMEM_MPI_SUPPORT=1
      export MPI_HOME=${MPICH_DIR}
      export NVSHMEM_LIBFABRIC_SUPPORT=1
      export LIBFABRIC_HOME=/opt/cray/libfabric/1.15.2.0
      export LD_LIBRARY_PATH=$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH
      export NVSHMEM_DISABLE_CUDA_VMM=1
      export FI_CXI_OPTIMIZED_MRS=false
      export NVSHMEM_BOOTSTRAP_TWO_STAGE=1
      export NVSHMEM_BOOTSTRAP=MPI
      export NVSHMEM_REMOTE_TRANSPORT=libfabric
    
  3. Batched interface to solve many independent systems at the same time
    Driver routine: p[d,s,z]gssvx3d_csc_batch.c
    Example program: p[d,s,z]drive3d.c [ -b batchCount ]

  4. Julia interface
    https://github.com/JuliaSparse/SuperLUDIST.jl

Dependencies: the following shows what needs to be defined in CMake build script

  1. Highly recommended:
  • BLAS:
    -DTPL_ENABLE_INTERNAL_BLASLIB=OFF
    -DTPL__BLAS_LIBRARIES=”path to your BLAS library file”
  • ParMETIS:
    -DTPL_PARMETIS_LIBRARIES=ON
    -DTPL_PARMETIS_INCLUDE_DIRS=”path to metis and parmetis header files”
    -DTPL_PARMETIS_LIBRARIES=”path to metis and parmetis library files”
  1. If you use GPU triangular solve, need the following:
  • LAPACK
    -DTPL_ENABLE_LAPACKLIB=ON
    -DTPL_LAPACK_LIBRARIES=”path to lapack library file”
  • NVSHMEM is needed when using multiple GPUs
    -DTPL_ENABLE_NVSHMEM=ON
    -DTPL_NVSHMEM_LIBRARIES=”path to nvshmem files”
  1. If you use batched interface, need MAGMA
    -DTPL_ENABLE_MAGMALIB=ON
    -DTPL_MAGMA_INCLUDE_DIRS=”path to magma header files”
    -DTPL_MAGMA_LIBRARIES=”path to magma library file”

What's Changed

  • Add create large array for broadcast by @SidShi in #157

New Contributors

Full Changelog: v8.2.1...v9.0.0