Simple script to monitor memory usage on a node. By default collects the total memory used on a node. However it accepts an option to collect memory per process ( by parsing the output of ps ) and filtering only processors with a memory usage above a certain thresold. All memory units are in KB. You can find a more detailed description by typing
bash -h
If you are using multiple nodes you need to launch the script on multiple nodes. With Slurm this means using srun to launch it in the backround before launching your parallel application.
You can find an example script below for ARCHER2.
#SBATCH --job-name=MY_JOB_NAME
#SBATCH --time=00:10:00
#SBATCH --exclusive
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --partition=standard
#SBATCH --qos=short
# Need overlap, oversubscribe and mem options to have two sruns running at the same time
srun --overlap --oversubscribe --mem=1GB --ntasks=$SLURM_NNODES --ntasks-per-node=1 ./ &
# Make sure it has started before running the main code
sleep 30
srun --overlap --oversubscribe --mem=220GB --unbuffered --distribution=block:block --hint=nomultithread my_mpi_code
This will create a separate log file for each node with the Slurm job
id appended, e.g. checkmem-nid004263-4103755.out
. As supplied
so it reports the free memory
(in KB) per node every 60 seconds - just edit it to change, for
example, the frequency.
Both the scripts need to be executable - after downloading them, issue:
chmod +x *.sh