Skip to content

Latest commit

 

History

History
69 lines (59 loc) · 3.11 KB

README.md

File metadata and controls

69 lines (59 loc) · 3.11 KB

Template for Creating Multi-node Parallel Batch Docker Images

Template scripts to setup Docker Images compatible with running on MNP Batch

License Summary

This sample code is made available under a modified MIT license. See the LICENSE file.

Horovod Tensorflow Deployment

To build a Tensorflow reference docker image compatible with running tightly coupled multi-node parallel batch jobs on AWS Batch. Build platform requires installation of nvidia-docker2.

git clone https://github.com/aws-samples/aws-mnpbatch-template.git
cd aws-mnpbatch-template
docker build -t nvidia/mnp-batch-tensorflow .

Custom Application Deployment

The Dockerfile can mostly be reused for your application, It installs the following stack:

Ubuntu 18.04 nvidia/cuda base docker image
APT packages for dependencies
SSH SETUP
S3 OPTIMIZATION
CUDA-AWARE OpenMPI 4.0.0
TENSORFLOW/HOROVOD INSTALL
IMAGENET DATASET
SUPERVISOR DOCKER CONTAINER STARTUP

Thus if you want apply your own customizations and application, you just need to modify the MPI, Tensorflow layers. Also custom build scripts are located in conf/.

Finally replace the section in supervised-scripts/mpi-run.sh to support the MPI startup of your custom application. The script logic will prepare the mpi machine file. If your node contains GPUs then the slots= with be the number of GPUs per node.

if [ -x "$(command -v nvidia-smi)" ] ; then
      NUM_GPUS=$(ls -l /dev/nvidia[0-9] | wc -l)
      availablecores=$NUM_GPUS
  else
      availablecores=$(nproc)
  fi

If not then it will based on the vCPUs/Cores where applicable and passed as ${HOST_FILE_PATH}-deduped. Any extra MPI parameters at job runtime will be passed into the $EXTRA_MPI_PARAMS.

wait_for_nodes () {
	.
	.
	.
  aws s3 cp $S3_INPUT $SCRATCH_DIR
  #tar -xvf $SCRATCH_DIR/*.tar.gz -C $SCRATCH_DIR

  cd $SCRATCH_DIR
  export INTERFACE=eth0
  export MODEL_HOME=/root/deep-learning-models/models/resnet/tensorflow
  /opt/openmpi/bin/mpirun --allow-run-as-root -np $MPI_GPUS --machinefile ${HOST_FILE_PATH}-deduped -mca plm_rsh_no_tree_spawn 1 \
                          -bind-to socket -map-by slot \
                          -x HOROVOD_HIERARCHICAL_ALLREDUCE=1 -x HOROVOD_FUSION_THRESHOLD=16777216 \
                          -x NCCL_MIN_NRINGS=4 -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib \
                          -x NCCL_SOCKET_IFNAME=$INTERFACE -mca btl_tcp_if_include $INTERFACE \
                          $EXTRA_MPI_PARAMS -x TF_CPP_MIN_LOG_LEVEL=0 \
                          python3 -W ignore $MODEL_HOME/train_imagenet_resnet_hvd.py \
                          --data_dir $JOB_DIR --num_epochs 90 -b $BATCH_SIZE \
                          --lr_decay_mode poly --warmup_epochs 10 --clear_log
  sleep 2

  #tar -czvf $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz $SCRATCH_DIR/*
  #aws s3 cp $JOB_DIR/batch_output_$AWS_BATCH_JOB_ID.tar.gz $S3_OUTPUT
 }

Once built you can commit this docker image to your AWS Elastic Container Registry (ECR) using these instructions.