-
Notifications
You must be signed in to change notification settings - Fork 55
DALES on Fugaku
dev branch
git clone https://github.com/dalesteam/dales
cd dales
git checkout dev
git submodule init
git submodule update
. /vol0004/apps/oss/spack/share/spack/setup-env.sh
spack load [email protected]%fj/mmdtg52
spack load fftw%fj
spack load [email protected]%gcc/wyds2me # load cmake to avoid the fj cmake loaded by spack
# for some reason this is needed for netcdf-c to find hdf5 libraries
export LDFLAGS="-lhdf5_hl -lhdf5"
export SYST=FX-Fujitsu
cmake .. -DUSE_FFTW=True
make -j 8
For single precision, substitute the cmake command:
cmake .. -DUSE_FFTW=True -DFIELD_PRECISION=32 -DPOIS_PRECISION=32
#!/bin/sh
#PJM -L "node=1"
#PJM -L "rscgrp=small"
#PJM -L "elapse=6:00:00"
#PJM --mpi max-proc-per-node=48
#PJM -x PJM_LLIO_GFSCACHE=/vol0004:/vol0005
#PJM -g hp240116
#PJM -s
# other PJM flags
# tuning the compute node file system caching behavior
# --llio localtmp-size=500Mi
# --llio cn-cache-size=1Gi # default = 128Mb
# --llio sio-read-cache=on
# do not create empty stdout/stderr files
export PLE_MPI_STD_EMPTYFILE=off
#load spack modules
. /vol0004/apps/oss/spack/share/spack/setup-env.sh
spack load [email protected]%fj/mmdtg52
spack load fftw%fj
DALES=/path/to/dales
NAMOPTIONS=namoptions.001
# make sure every compute node has the executable in file-system cache
# especially recomended for large jobs
llio_transfer ${DALES}
mpiexec -n $DALES $NAMOPTIONS
7.6.2021, by Fredrik Jansson
Branch to4.4_fredrik contains a few fixes needed on Fugaku, in particular the compiler settings for cmake.
The amount of ram per core is rather small, ~600 MB.
NetCDF4 seems to require a lot of memory, there is an option in the namelist to switch to NetCDF3, lclassic = .true.
It's probably better to turn off netCDF synchronization, to reduce the amount of disk IO - don't set lsync = .true.
.
New spack, new compiler environment. Must explicitly request volumes to be mounted by the compute nodes. We request /vol0004 because Spack is located there. Compiler frtpx (FRT) 4.7.0 20211110 .
Add this to the job script header to request mounting /vol0004 :
#PJM -x PJM_LLIO_GFSCACHE=/vol0004
Before compiling DALES, and in the run script before launching DALES:
. /vol0004/apps/oss/spack-v0.17.0/share/spack/setup-env.sh
spack load netcdf-fortran%fj
spack load fftw%fj
With this update there is a working system-wide spack again. It uses the new tcsds-1.2.33 toolchain. The issues with MPI functions not accepting (..) arguments should now be solved, and also the MPI problems from mixing different tcdsc versions.
. /vol0004/apps/oss/spack-v0.16.2/share/spack/setup-env.sh
spack load netcdf-fortran%fj /bubmb4i
spack load fftw%fj
On Fugaku the spack package system is used to manage modules and libraries. We need it for fftw, netcdf and perhaps HYPRE. Currently the system-wide spack installation points to an older language environment tcsds-1.2.29
, which doesn't work well with DALES due to MPI problems.
(/vol0004/apps/oss/spack/etc/spack/packages.yaml refers to lang/tcsds-1.2.29).
As a work-around, we can set up a private spack environment. See Fugaku manual. Follow the steps, with these modifications:
- In step 3.2, replace
tcsds-1.2.29
bytcsds-1.2.31
in$(HOME)/.spack/linux/compilers.yaml
- In step 3.3, don't link to the public instance
- In step 3.4, replace
tcsds-1.2.29
bytcsds-1.2.31
in$(HOME)/.spack/linux/packages.yaml
Submit an interactive job and install the spack modules
pjsub --interact -L "node=1" -L "rscgrp=int" -L "elapse=4:00:00" --sparam wait-time=900 --mpi max-proc-per-node=48
. ~/spack/share/spack/setup-env.sh
spack install fftw openmp=True
spack install netcdf-fortran
There are two Fujitsu compilers, named mpifrtpx
(cross compiler usable on login node) and
mpifrt
(usable on compute nodes). Currently the DALES CMakeList specifies the cross compiler, so these steps work on the login node.
. ~/spack/share/spack/setup-env.sh
spack load netcdf-fortran%fj
spack load fftw%fj
# workaround for library errors in git after loading spack
export LD_LIBRARY_PATH=/lib64:$LD_LIBRARY_PATH
export SYST=FX-Fujitsu
export LDFLAGS="-lhdf5_hl -lhdf5"
mkdir build
cd build
cmake ../dales -DUSE_FFTW=True
make -j 4 2>&1 | tee compilation-log.txt
There is a compilation script for a local spack environment in the dales-tester repository: https://github.com/fjansson/dales-tester/blob/master/compile-fugaku-localspack.sh
For better file system performance, one should access files through the layered file system LLIO.
In practice, for any path that starts with /vol0004/...
use /vol0004_cache/...
. This only works on the compute nodes. DALES writes output files in the current directory, so the job script should cd to /vol0004_cache/...
before launching DALES.
Sample run script, where NX,NY,the number of nodes and the git commit tag of the dales binary can be specified on submission. If restart files exist in the job directory, they are used for a restart.
#!/bin/sh
#PJM -L "node=2"
#PJM -L "rscunit=rscunit_ft01"
#PJM -L "rscgrp=small"
#PJM -L "elapse=72:00:00"
#PJM --mpi max-proc-per-node=48
#PJM --llio cn-cache-size=1Gi # default = 128Mb
#PJM --llio sio-read-cache=on
#PJM -s
# submit as
# pjsub -x "TAG=c8cf1,NX=6,NY=8" -L "node=1" dales-fftw-fugaku.job
# defaults:
#NX=8 NY=12 TAG=78364 2 nodes
# do not create empty stdout/stderr files
export PLE_MPI_STD_EMPTYFILE=off
# load local spack environment
. ~/spack/share/spack/setup-env.sh
spack load netcdf-fortran%fj
spack load fftw%fj
if [ -z "$TAG" ]
then
TAG=78364
fi
if [ -z "$NX" ]
then
NX=8
fi
if [ -z "$NY" ]
then
NY=12
fi
SYST=FX-Fujitsu
NAMOPTIONS=namoptions.001
NTOT=$(($NX*NY))
DALES=/vol0004_cache/your-home-directory/dales-tester/build-$TAG-$SYST/src/dales4
# note use full path here, starting with /vol0004_cache/
llio_transfer ${DALES}
# distribute the binary to the compute nodes
# not required but might help performance when using many nodes
# the spack shared libraries are still accessed through /vol_0004/...
WORK=`pwd -P | sed 's/vol[0-9]*/&_cache/'`
# get current directory, resolving symlinks. E.g. /vol0004/hp120279/u00892/runs/
# use sed to edit volXXXX to volXXXX_cache
cd $WORK
# make symlinks for RRTMG
ln -s ../../rrtmg_lw.nc ./
ln -s ../../rrtmg_sw.nc ./
ln -s ../../backrad.inp.001.nc ./
# edit nprocx, nprocy in namelist
sed -i -r "s/nprocx.*=.*/nprocx = $NX/;s/nprocy.*=.*/nprocy = $NY/" $NAMOPTIONS
# do a restart if files for that exist
if [ -f "initdlatestmx000y000.001" ]
then
# edit lwarmstart to true
sed -i -r "s/lwarmstart.*=.*/lwarmstart = .true./" $NAMOPTIONS
# edit startfile
sed -i -r "s/startfile.*=.*/startfile = \"initdlatestmx000y000.001\"/" $NAMOPTIONS
fi
echo SYST $SYST
echo DALES $DALES
echo WORK $WORK
echo NTOT $NTOT
echo NX,NY $NX,$NY
mpiexec -n $NTOT $DALES $NAMOPTIONS
Fugaku contains some post-processing nodes with x86 CPUs and more memory than the compute nodes. https://www.fugaku.r-ccs.riken.jp/doc_root/en/user_guides/pps-slurm-1.1/ These nodes use the slurm queue system.
merge_grids.py from https://github.com/CloudResolvingClimateModeling/dalesview can be used on the post-processing nodes.
Sample job script.
#!/bin/bash
#SBATCH -p ppmq # Specify a queue
#SBATCH -N 1 # Specify the number of allocated nodes
#SBATCH -J merge # Specify job name
# merge script for the post-processing queue
. /vol0004/apps/oss/spack/share/spack/setup-env.sh
spack load py-pip%gcc
# Setup, on login node:
# spack load py-pip%gcc
# pip install --user netCDF4
for d in Run_76 Run_77 Run_81 ; do
pushd runs/$d
/usr/bin/time --format='(%e s %M kB)' ~/dalesview/merge_grids.py -j 1 --cross &
/usr/bin/time --format='(%e s %M kB)' ~/dalesview/merge_grids.py -j 1 --fielddump &
wait
popd
done
Easiest here to bypass spack entirely, since it doesn't contain all the right versions of conda to match the various systems you'll encounter. On login nodes and postprocessing nodes, we need the x86_64 version, on regular compute nodes the aarch64 version. One way to have everything is to install both:
# On login node
mkdir ~/miniconda3
cd miniconda3/
wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.9.2-Linux-x86_64.sh
bash Miniconda3-py37_4.9.2-Linux-x86_64.sh -u
source ~/.bashrc
conda create -n cloudmetenv python=3.7
conda activate cloudmetenv
conda install numpy scipy matplotlib netcdf4 pandas pytables pywavelets scikit-image scikit-learn seaborn spyder tqdm
# For compute node
mkdir ~/miniconda3-aarch64
cd miniconda3-aarch64/
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-aarch64.sh
To install on a compute node, run an interactive job:
pjsub --interact -L "node=1" -L "rscgrp=int" -L "elapse=4:00:00" --sparam wait-time=900 --mpi max-proc-per-node=48
bash Miniconda3-latest-Linux-aarch64.sh -u
source ~/.bashrc
conda create -n cloudmetenv python=3.7
conda activate cloudmetenv
conda install numpy scipy matplotlib netcdf4 pandas pytables pywavelets scikit-image scikit-learn seaborn tqdm
It is possible that conda install
will report (phantom) merge conflicts for some of these packages. While I (Martin) haven't figured out exactly what causes this, I've managed to get around it by installing the offending packages from conda-forge
, i.e. using:
conda install -c conda-forge netcdf4
Once the environment is set up, this now requires turning off the automatic activation of the base conda environment, as it will fail on a login node that's not aarch64:
conda config --set auto_activate_base false
The version that's installed the latest is the default to which conda
commands point. In this case, if you're on an x86_64
node and would like to run conda
, you have to manually source it from the right location:
source ~/miniconda3/bin/activate
From here on, conda
should work normally.
The graphical forwarding you can get with ssh -Y
is too slow to be practical, making the use of a graphical IDE (such as Spyder) unpractical. To avoid this, you can run interactive python in iPython notebooks, which can be forwarded to your local computer. To do so, activate your conda environment and run:
login2$ source <path_where_conda_is_installed>/miniconda3/bin/activate # Activate the correct conda, depending on if you are on login/compute node
(base) login2$ conda activate cloudmetenv # activate the right environment
(base) login2$ conda install -c anaconda jupyter # installs jupyter and dependencies
The last line is obviously only necessary the first time you run this. Now fire up a notebook (with port argument indicating which port you'll connect to it with from your local machine):
(cloudmetenv) login2$ jupyter notebook --no-browser --port=8080
And, to get to this, run, on your local machine in a new terminal:
(base) MacBook-Pro-van-Martin:~ martinjanssens$ ssh -N -L localhost:8080:localhost:8080 <fugaku_user>@login2.fugaku.r-ccs.riken.jp
Here, make sure you connect to the same login node that you ran the notebook server in, and that the second port in the above command is the same as that you've opened the server on. Finally, you can access the notebook (make sure you have jupyter installed locally) by navigating to
localhost:8080
in a browser on the local machine. You may have to enter an access token to do so, which you'll have seen printed to your fugaku terminal after you fired the notebook up.
Merging with cdo is much faster than with the Python script. From version 2.0.4 (released 14 Feb 2022) onward, cdo has improved support for non-geographic grids like the ones we have in DALES. Improvements: no longer need to specify NX or list the tiles in the right order, and horizontal coordinate variables (xm, xt, ym, yt) are preserved. Previously these variables were lost.
(todo: how to get version 2.0.4? wait for new spack or install from source?)
Set up a local spack as above.
. ~/spack/share/spack/setup-env.sh
spack install cdo
Installation tested on compute node. On the compute node, installation took 3.5h - make the interactive job long enough for this to finish. (Jan 2022: cdo installation of spack on login node does not work, see below for installing from source)
. /vol0004/apps/oss/spack/share/spack/setup-env.sh
spack load /eh42puo # [email protected]%[email protected] arch=linux-rhel8-cascadelake
spack load /mutkzkd # [email protected]%[email protected] arch=linux-rhel8-skylake_avx512
mkdir src
cd src
wget https://code.mpimet.mpg.de/attachments/download/26761/cdo-2.0.4.tar.gz
tar -xzf cdo-2.0.4.tar.gz
cd cdo-2.0.4
./configure --prefix=${HOME}/OSS/cdo-x86 --with-netcdf=yes CC="gcc" CXX"=g++" F77=gfortran
make
# make check
# FAIL: tsformat.test 8 - chaining set 1 with netCDF4
make install
The one test failure is probably due to HDF5 library lacking thread support.
. ~/spack/share/spack/setup-env.sh
spack load cdo
cdo
CDO requires that the time dimension has units in the form seconds since 2020-01-02T00:00:00
.
DALES by default only gives the unit s
, and then CDO outputs nonsense time values.
A simple fix is to add xyear = 2020
in the &DOMAIN
namelist - if both xyear and xday are present a proper time unit is written in the netCDF. (There seems to be an off-by-one bug in the date output. Day-of-year starts at 1 for the first day of the year.)
NX=`ls surf_xy.x*y000.001.nc | wc -l` # find number of tiles in X direction.
cdo -f nc4 -z zip_6 -r -O collgrid,$NX `ls fielddump.*.001.nc | sort -t . -k 3` 3d.nc
cdo -f nc4 -z zip_6 -r -O collgrid,$NX `ls cape.*.nc | sort -t y -k 2` merged-cape.nc
# can specify a single variable to merge:
cdo -f nc4 -z zip_6 -r -O collgrid,$NX,thlxy `ls crossxy.0001.*.nc | sort -t y -k 3` merged-crossxy-thl.nc
The files should be ordered so that consecutive tiles are adjacent in X, sort takes care of this.
cdo doesn't like (output) files where the variables use different grids e.g. mixing velocity and temperature.
A work-around is to output these variables to different files.
-z zip_6
controls compression.
- copy the files to the node's local storage before merging. Use /tmp/ ?
- how many merge jobs can be run at once?
- avoid saving fields we don't need. CAPE contains a lot, we mainly want LWP, TWP, RWP, cloud-top-height.
There are two profilers, FIPP and FAPP. FAPP is more advanced. Profiling section in manual Profiler manual, pdf
# run the program to profile with fipp:
fipp -m 128000 -C -d profiling_data -Icall,cpupa,mpi mpiexec -n $NTOT $DALES $NAMOPTIONS
# -m sets the amount of memory to reserve for the measurements
# analyse
fipppx -A -pall -d profiling_data/ > fipp-output.txt
One can isolate a region of the program for measurement by inserting calls to start/stop functions. If that is not done, the whole program is profiled. No special compiler flags are needed for profiling. I have not managed to get MPI measurements from FIPP.
Add the flag -Koptmsg=2
. Output is quite verbose.
Add the flag -Kfast,parallel
to the compiler.
Reduce the number of processes per node, e.g. #PJM --mpi max-proc-per-node=8
in the job script.
Set the number of threads per process in the job script:
export PARALLEL=6 # Specify the number of thread
export OMP_NUM_THREADS=${PARALLEL} # Specify the number of thread
This seems to work, but is not faster than flat MPI.
GDB seems to work, on an interactive node the --gdbx command can be used like such with a gdb commands file:
mpirun -n 1 --gdbx gdb_cmds ../../../build_debug/src/dales4.3 ../namoptions.002
where the file gdb_cmds contains:
run
bt
quit