Skip to content

Running EXARL on NERSC

rvinaybharadwaj edited this page Feb 22, 2022 · 1 revision

Computing Environment Intructions

ExaRL is configured for the Cori GPU cluster at NERSC. Its dependencies are provided in a Docker image, which is executed using Shifter. The image is uploaded to the NERSC private registry under the name registry.nersc.gov/apg/exarl-ngc:0.1. (The 0.1 is a meainingless version number and can be ignored.)

The image mentioned above does not contain the ExaRL code itself. However, one can build ExaRL against the dependencies provided in that image. To do so, one can run the following commands on Cori (either a login node or a GPU node):

shifter --image=registry.nersc.gov/apg/exarl-ngc:0.1 /bin/bash
cd /path/to/ExaRL
pip install -e .
exit

The above steps will load the Shifter image containing the ExaRL dependencies, then build ExaRL against the contents of that image, and then exit the image. Note that, unlike Docker which overwrites the entire file system with the contents of the image, Shifter ensures that all NERSC file systems ($HOME, $SCRATCH, CFS, etc) continue to be visible even when the image is loaded.

In the above example, ExaRL will be installed into /path/to/ExaRL. One must ensure that PYTHONPATH includes that directory, or else execution of the code will fail due to being unable to find the exarl Python module.

Example scripts for running the ExaRL code on Cori GPU are located here on the NERSC Community File System (CFS):

/global/cfs/cdirs/m3363/exarl-profiling/template-job-scripts

To run the code with 8 MPI processes, and 1 GPU per process (8 GPUs total) on Cori GPU:

#!/bin/bash
#SBATCH -C gpu
#SBATCH -n 8
#SBATCH --gpus-per-task=1
#SBATCH -c 10
#SBATCH -t 120
#SBATCH -J ExaRL-ExaLearnBlockCoPolymerTDLG-v3-8g-cgpu
#SBATCH -o ExaRL-ExaLearnBlockCoPolymerTDLG-v3-8g-cgpu.%j.out
#SBATCH -L cfs
#SBATCH --image=registry.nersc.gov/apg/exarl-ngc:0.1
#SBATCH --gpu-bind=map_gpu:0,1,2,3,4,5,6,7,8

export SLURM_CPU_BIND="cores"

output_dir="${$SCRATCH}/exarl-output-${SLURM_JOB_ID}"
mkdir ${output_dir}
cd /global/cfs/cdirs/m3363/exarl-profiling/ExaRL.git 

srun \
shifter \
python \
  driver/driver_example.py \
  --output_dir ${output_dir} \
  --env ExaLearnBlockCoPolymerTDLG-v3 \
  --n_episodes 500 \
  --n_steps 60 \
  --learner_type async \
  --agent DQN-v0 \
  --model_type LSTM

Note that the argument of --output_dir must be a NERSC file system location, not a location inside the Shifter image, otherwise the code will fail because the contents of Shifter images are mounted read-only.

Create an environment using conda

  • This instruction is for using Cori GPU.
module load cgpu

Create a conda environment (i.e. exarl)

module load tensorflow/gpu-2.1.0-py37
conda activate exarl
# install whatever EXARL needs
pip install -e . --user
...

Get an interactive node, then run EXARL

module load tensorflow/gpu-2.1.0-py37
conda activate exarl
srun -n 2 -c 2  python exarl/driver/ --env ExaBoosterDiscrete-v0 --n_episodes 100 --n_steps 100 --learner_procs 1