-
Notifications
You must be signed in to change notification settings - Fork 5
Running EXARL on NERSC
Computing Environment Intructions
ExaRL is configured for the Cori GPU cluster
at NERSC. Its dependencies are provided in a Docker image, which is executed
using
Shifter.
The image is uploaded to the NERSC private
registry
under the name registry.nersc.gov/apg/exarl-ngc:0.1
. (The 0.1
is a
meainingless version number and can be ignored.)
The image mentioned above does not contain the ExaRL code itself. However, one can build ExaRL against the dependencies provided in that image. To do so, one can run the following commands on Cori (either a login node or a GPU node):
shifter --image=registry.nersc.gov/apg/exarl-ngc:0.1 /bin/bash
cd /path/to/ExaRL
pip install -e .
exit
The above steps will load the Shifter image containing the ExaRL dependencies,
then build ExaRL against the contents of that image, and then exit the image.
Note that, unlike Docker which overwrites the entire file system with the
contents of the image, Shifter ensures that all NERSC file systems ($HOME
,
$SCRATCH
, CFS, etc) continue to be visible even when the image is loaded.
In the above example, ExaRL will be installed into /path/to/ExaRL
. One must
ensure that PYTHONPATH
includes that directory, or else execution of the code
will fail due to being unable to find the exarl
Python module.
Example scripts for running the ExaRL code on Cori GPU are located here on the NERSC Community File System (CFS):
/global/cfs/cdirs/m3363/exarl-profiling/template-job-scripts
To run the code with 8 MPI processes, and 1 GPU per process (8 GPUs total) on Cori GPU:
#!/bin/bash
#SBATCH -C gpu
#SBATCH -n 8
#SBATCH --gpus-per-task=1
#SBATCH -c 10
#SBATCH -t 120
#SBATCH -J ExaRL-ExaLearnBlockCoPolymerTDLG-v3-8g-cgpu
#SBATCH -o ExaRL-ExaLearnBlockCoPolymerTDLG-v3-8g-cgpu.%j.out
#SBATCH -L cfs
#SBATCH --image=registry.nersc.gov/apg/exarl-ngc:0.1
#SBATCH --gpu-bind=map_gpu:0,1,2,3,4,5,6,7,8
export SLURM_CPU_BIND="cores"
output_dir="${$SCRATCH}/exarl-output-${SLURM_JOB_ID}"
mkdir ${output_dir}
cd /global/cfs/cdirs/m3363/exarl-profiling/ExaRL.git
srun \
shifter \
python \
driver/driver_example.py \
--output_dir ${output_dir} \
--env ExaLearnBlockCoPolymerTDLG-v3 \
--n_episodes 500 \
--n_steps 60 \
--learner_type async \
--agent DQN-v0 \
--model_type LSTM
Note that the argument of --output_dir
must be a NERSC file system location,
not a location inside the Shifter image, otherwise the code will fail because
the contents of Shifter images are mounted read-only.
- This instruction is for using Cori GPU.
module load cgpu
Create a conda environment (i.e. exarl
)
module load tensorflow/gpu-2.1.0-py37
conda activate exarl
# install whatever EXARL needs
pip install -e . --user
...
Get an interactive node, then run EXARL
module load tensorflow/gpu-2.1.0-py37
conda activate exarl
srun -n 2 -c 2 python exarl/driver/ --env ExaBoosterDiscrete-v0 --n_episodes 100 --n_steps 100 --learner_procs 1