You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I have parallel studio 2017 update 7 and I have successfully compiled ELPA 2017.11.001 then QE v6.3 via the configure-xxx-hsw.sh script.
It is okay when I try to run QE v6.3 on a single node in the cluster, i.e. srun -p ABC -N 1 -n 176 pw.x < my.in > my.out.
However, once I try over 2 nodes, i.e. srun -p ABC -N 2 -n 352 pw.x < my.in > my.out, it produces the strange "Error in routine cdiaghg problems computing cholesky" error.
If I compile ELPA and QE with configure-xxx-hsw-omp.sh script, it is also okay for single node. However, if 2 nodes, it produces "PMPI_Group_incl: Invalid rank, error stack:" message in the slurm-xxx.out
Could you please have a look at QE v6.3?
Moreover, conventional compilation without xconfigure run okay across multi nodes, i.e. ./configure CC=icc CXX=icpc F77=ifort F90=ifort MPIF90=mpiifort --enable-shared --enable-parallel --disable-openmp --with-scalapack=intel CFLAGS="-O3 -I -xCORE-AVX2" CXXFLAGS="-O3 -I -xCORE-AVX2" FCFLAGS="-O3 -I -xCORE-AVX2" F90FLAGS="-O3 -I -xCORE-AVX2" FFLAGS="-O3 -I -xCORE-AVX2
Thanks,
Rolly
The text was updated successfully, but these errors were encountered:
Thank you for the report! At a first look, this looks like a problem only occurring when ELPA is incorporated. I may step back from ELPA as a default with Xconfigure, or find a version that works again.
Hi Hans,
I have done some further tests and found that the -D__NON_BLOCKING_SCATTER in QE make.inc creates the problem.
I have compiled ELPA as instructed, then remove this parameter in QE make.inc. The v6.3 runs, but I have to make use of pw.x -nk 2 to speed up the parallel speed. Otherwise, 2 nodes runs slower then 1 node on the AUSURF112 benchmark.
Not sure if -nk 2 can fix the problem?
Thanks,
Rolly
Hi,
I have parallel studio 2017 update 7 and I have successfully compiled ELPA 2017.11.001 then QE v6.3 via the configure-xxx-hsw.sh script.
It is okay when I try to run QE v6.3 on a single node in the cluster, i.e. srun -p ABC -N 1 -n 176 pw.x < my.in > my.out.
However, once I try over 2 nodes, i.e. srun -p ABC -N 2 -n 352 pw.x < my.in > my.out, it produces the strange "Error in routine cdiaghg problems computing cholesky" error.
If I compile ELPA and QE with configure-xxx-hsw-omp.sh script, it is also okay for single node. However, if 2 nodes, it produces "PMPI_Group_incl: Invalid rank, error stack:" message in the slurm-xxx.out
Could you please have a look at QE v6.3?
Moreover, conventional compilation without xconfigure run okay across multi nodes, i.e. ./configure CC=icc CXX=icpc F77=ifort F90=ifort MPIF90=mpiifort --enable-shared --enable-parallel --disable-openmp --with-scalapack=intel CFLAGS="-O3 -I -xCORE-AVX2" CXXFLAGS="-O3 -I -xCORE-AVX2" FCFLAGS="-O3 -I -xCORE-AVX2" F90FLAGS="-O3 -I -xCORE-AVX2" FFLAGS="-O3 -I -xCORE-AVX2
Thanks,
Rolly
The text was updated successfully, but these errors were encountered: