Skip to content

Running EXARL on Summit

Himanshu Sharma edited this page Mar 2, 2022 · 8 revisions

Computing Environment Instructions for Summit

Method 0: Use the build script

bash setup/build_exarl.sh

Method 1: Useful for quickstart. Here you clone a pre-built conda environment.

Step 1: Source the script to load modules and set environment variables

source /ccs/proj/ast153/ExaLearn/set_summit_env.sh
and clone the exarl repository
git clone --recursive https://github.com/exalearn/EXARL.git

Step 2: Clone the conda environment from project directory (you have to clone it only once, if you have already cloned the conda environment go to step 3)

conda create --name exarl_summit --clone /ccs/proj/ast153/ExaLearn/.conda/envs

Step 3: If you have already cloned the environment, activate it

conda activate exarl_summit

Now, you should have the basic dependencies installed to run EXARL. If you need to install additional dependencies, install them in your local copy of the environment.
Please do not modify the shared environment unless you exactly know what you are doing!

Method 2: Manually create a conda environment

Module loading

module purge
module load spectrum-mpi/10.4.0.3-20210112
module load cuda/11.0.3
module load open-ce/1.4.0-py37-0
module load git-lfs

and clone the exarl repository
git clone --recursive https://github.com/exalearn/EXARL.git

Clone the IBM CE environment

conda create --name exarl_summit --clone open-ce-1.4.0-py37-0
conda activate exarl_summit 

Install additional dependencies using conda

conda install numba
conda install -c conda-forge matplotlib
pip install plotille
pip install gym==0.17.1

Add EXARL to python path

export PYTHONPATH=<EXARL top level dir>:$PYTHONPATH

Building mpi4py (Optional, if conda install of mpi4py doesn't work. Recommended: run module load spectrum-mpi/10.4.0.3-20210112)

wget https://github.com/mpi4py/mpi4py/archive/3.0.3.tar.gz
tar -xvzf 3.0.3.tar.gz
cd mpi4py-3.0.3
python setup.py build --mpicc `which mpicc`
python setup.py install --user

Running a job

  • Interactive job

    1. Get an interactive allocation ("gpumps" option can be used when multiple processes need to access
      a particular GPU; however it turns on CUDA MPS, which usually increase kernel launch times, so
      use only if necessary):
      bsub -Is -alloc_flags "gpumps" -P ast153 -nnodes 1 -W 2:00 $SHELL

    2. Set environment, for example, after adding necessary modules and activating the conda environment:
      source /ccs/proj/ast153/ExaLearn/set_summit_env.sh
      conda activate exarl_summit
      export OMP_NUM_THREADS=1
      Do not set CUDA_VISIBLE_DEVICES, but use the jobs script below with multiple resource sets.

    3. Run the job after getting an allocation
      jsrun --nrs 6 --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --rs_per_host 6 --latency_priority CPU-CPU --launch_distribution packed --bind packed:7 python <...>

  • Batch job
    Example script to run the CartPole environment (this particular example shows the CPU version)
    on multiple nodes (pay attention to the passed arguments, you may not need to pass those or may
    need to change the value of the arguments):

#!/bin/bash
#BSUB -P AST153
#BSUB -W 1:30
#BSUB -nnodes 32
#BSUB -J CartPole
#BSUB -o Cartpole.%J
#BSUB -e CartPole.%J

source /ccs/proj/ast153/ExaLearn/set_summit_env.sh
conda activate exarl_summit
export OMP_NUM_THREADS=7

for t in 1 2 4 8 16 32
do
export nres=$((t*6))
echo "---------------------------------------"
echo "Running ExaRL with CartPole on $t nodes"
echo "---------------------------------------"
jsrun --nrs $nres --tasks_per_rs 1 --cpu_per_rs 7 --gpu_per_rs 1 --rs_per_host 6 --latency_priority GPU-CPU --launch_distribution packed --bind packed:7 python exarl/driver/__main__.py --output_dir /gpfs/alpine/ast153/scratch/$USER/ --env CartPole-v0 --n_episodes 500 --n_steps 60 --learner_type async --agent DQN-v0 --model_type LSTM
done

Advanced topics

Check the impact of some important jsrun arguments from: https://jobstepviewer.olcf.ornl.gov.
If the GPU is used, then after your job completes, there will be some messages
from TensorFlow (assuming device logging is enabled when TF is configured), like the
following (note the GPU:0, in case CPU is used, it would be CPU:0):

... Placeholder: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0 Placeholder_1: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0 Placeholder_2: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0 Placeholder_3: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0 Placeholder_4: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0 Placeholder_5: (Placeholder): /job:localhost/replica:0/task:0/device:GPU:0 ...

Summit user-guide: https://docs.olcf.ornl.gov/systems/summit_user_guide.html
Summit job step viewer: https://jsrunvisualizer.olcf.ornl.gov/

About the default configuration

(from Tom Papatheodore, OLCF Support)
The CUDA runtime will label GPUs within a resource set starting with index 0 regardless of the actual GPU ID on the system. For example, if you define 2 resource sets, each with 3 GPUs, then the CUDA runtime will label them as 0-2 for the first resource set and also 0-2 for the second resource set, even though they are actually GPUs 0-2 (first resource set) and GPUs 3-5 (second resource set). See the example below where RT_GPU_id prints the CUDA runtime labeling of GPUs and GPU_id prints the global GPU ID on the node.
[tpapathe@batch1: ~/repositories/Hello_jsrun]$ jsrun -n2 -c21 -g3 -a1 -bpacked:21 ./hello_jsrun | sort
---------- MPI Ranks: 2, OpenMP Threads: 1, GPUs per Resource Set: 3 ----------
MPI Rank 000, OMP_thread 00 on HWThread 000 of Node f02n05 - RT_GPU_id 0 1 2 :
GPU_id 0 1 2
MPI Rank 001, OMP_thread 00 on HWThread 088 of Node f02n05 - RT_GPU_id 0 1 2 :
GPU_id 3 4 5
So the one MPI rank in the first resource set (rank 0) has access to global GPU IDs 0-2 and also shows GPUs IDs 0-2 from the CUDA runtime, but the one MPI rank in the second resource set (rank 1) has access to global GPU IDs 3-5 although the CUDA runtime labels them as 0-2. However, the MPI ranks do actually have access to different GPUs.

MPI performance

As per Summit user guide, the official MPI distribution of Summit, i.e., Spectrum MPI is
tuned to minimize latency and not bandwidth. If the code makes heavy use of all-to-all type
collectives at some point, then bandwidth maximization would be ideal, and the following
environment variables must be passed (from: https://docs.olcf.ornl.gov/systems/summit_user_guide.html)
$ export PAMI_ENABLE_STRIPING=1
$ export PAMI_IBV_ADAPTER_AFFINITY=1
$ export PAMI_IBV_DEVICE_NAME="mlx5_0:1,mlx5_3:1"
$ export PAMI_IBV_DEVICE_NAME_1="mlx5_3:1,mlx5_0:1"
In the current version of the code these variables did not make any difference.

Known issues

  1. The Python logging module can cause heavy slowdowns due to I/O overhead.
    In that case, check out if unbuffered I/O is used (or flush output buffer).

  2. [Needs confirmation] Normally we use GPUs for training, but when exclusively using
    CPU(s) for training, you may not be able to assign all possible processor threads of a node
    (for e.g., an IBM AC922 node in Summit has 42 cores over two sockets, and each core can have
    up to 4 active threads) to MPI, as the underlying Keras platform may use multiprocessing module
    to spawn extra threads.

Clone this wiki locally