Release v9.0.0 release · xiaoyeli/superlu_dist

V9.0.0 Release note

An example program is EXAMPLE/pddrive3d.c, calling the driver routine: SRC/double/pgssvx3d.c (or pdgssvx3d_csc_batch.c)

Please cite this ACM TOMS paper when you use these new features.

OpenMP performance hit:
On many systems, the default OMP_NUM_THREADS is set to to be the total number of CPU cores on a node. For example, it is set to be 128 on Perlmutter at NERSC. This is too high, because most of the algorithms are not efficient in the pure threading mode. We recommend users to experiment with mixed MPI and OpenMP mode, starting with smaller thread count, by settiing:
export OMP_NUM_THREADS=1, or 2, or 3, ....

The new features include the following:

LU factorization: diagonal factorization, panel factorization, & Schur-complement update
can all offloaded to GPU
Environment variables:
- export SUPERLU_ACC_OFFLOAD=1 (default setting: enable GPU)
  - export GPU3DVERSION=1 (default setting; use code in CplusplusFactor/ for all offload )
  - export GPU3DVERSION=0 (only Schur-complement updates are offloaded)
Triangular solve: new 3D communication-avoiding code
Environment variable:
export SUPERLU_ACC_SOLVE=0 (default setting; only on CPU)
export SUPERLU_ACC_SOLVE=1 (offload to GPU)

** NOTE: when using multiple NVIDIA GPUs per 2D grid for GPU triangular solve, we use NVSHMEM for fast
inter-GPU communication. You need to configure NVSHMEM properly.
For example, on Perlmutter at NERSC, we need the following setup:
module load nvshmem/2.11.0
export NVSHMEM_HOME=/global/common/software/nersc9/nvshmem/2.11.0
```
  export NVSHMEM_USE_GDRCOPY=1
  export NVSHMEM_MPI_SUPPORT=1
  export MPI_HOME=${MPICH_DIR}
  export NVSHMEM_LIBFABRIC_SUPPORT=1
  export LIBFABRIC_HOME=/opt/cray/libfabric/1.15.2.0
  export LD_LIBRARY_PATH=$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH
  export NVSHMEM_DISABLE_CUDA_VMM=1
  export FI_CXI_OPTIMIZED_MRS=false
  export NVSHMEM_BOOTSTRAP_TWO_STAGE=1
  export NVSHMEM_BOOTSTRAP=MPI
  export NVSHMEM_REMOTE_TRANSPORT=libfabric
```
Batched interface to solve many independent systems at the same time
Driver routine: p[d,s,z]gssvx3d_csc_batch.c
Example program: p[d,s,z]drive3d.c [ -b batchCount ]
Julia interface
https://github.com/JuliaSparse/SuperLUDIST.jl

Dependencies: the following shows what needs to be defined in CMake build script

Highly recommended:

BLAS:
-DTPL_ENABLE_INTERNAL_BLASLIB=OFF
-DTPL__BLAS_LIBRARIES=”path to your BLAS library file”
ParMETIS:
-DTPL_PARMETIS_LIBRARIES=ON
-DTPL_PARMETIS_INCLUDE_DIRS=”path to metis and parmetis header files”
-DTPL_PARMETIS_LIBRARIES=”path to metis and parmetis library files”

If you use GPU triangular solve, need the following:

LAPACK
-DTPL_ENABLE_LAPACKLIB=ON
-DTPL_LAPACK_LIBRARIES=”path to lapack library file”
NVSHMEM is needed when using multiple GPUs
-DTPL_ENABLE_NVSHMEM=ON
-DTPL_NVSHMEM_LIBRARIES=”path to nvshmem files”

If you use batched interface, need MAGMA
-DTPL_ENABLE_MAGMALIB=ON
-DTPL_MAGMA_INCLUDE_DIRS=”path to magma header files”
-DTPL_MAGMA_LIBRARIES=”path to magma library file”

What's Changed

Add create large array for broadcast by @SidShi in #157

New Contributors

@SidShi made their first contribution in #157

Full Changelog: v8.2.1...v9.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v9.0.0 release

What's Changed

New Contributors

Contributors