Merge branch 'main' of https://github.com/OSU-Nowlab/MPI4DL into GEMS…

…_SPATIAL Signed-off-by: Radha Guhane <[email protected]>
OSU-Nowlab · Nov 5, 2023 · fbf2040 · fbf2040
2 parents 7d0d186 + c671670
commit fbf2040
Show file tree

Hide file tree

Showing 14 changed files with 230 additions and 536 deletions.
diff --git a/.github/workflows/MPI4DL-github-repo-stats.yml b/.github/workflows/MPI4DL-github-repo-stats.yml
@@ -0,0 +1,20 @@
+name: MPI4DL-github-repo-stats.yml
+
+on:
+  schedule:
+    # Run this once per day, towards the end of the day for keeping the most
+    # recent data point most meaningful (hours are interpreted in UTC).
+    - cron: "0 23 * * *"
+  workflow_dispatch: # Allow for running this manually.
+
+jobs:
+  j1:
+    name: MPI4DL-github-repo-stats.yml
+    runs-on: ubuntu-latest
+    steps:
+      - name: run-ghrs
+        # Use latest release.
+        uses: jgehrcke/github-repo-stats@RELEASE
+        with:
+          ghtoken: ${{ secrets.ghrs_github_api_token }}
+
diff --git a/README.md b/README.md
@@ -1,3 +1,5 @@
+MPI4DL is a [HiDL](https://hidl.cse.ohio-state.edu/) project. We encourage you to visit the [HiDL website](https://hidl.cse.ohio-state.edu/) for additional information, the latest performance numbers, and similar projects on high-performance machine and deep learning. For the latest announcements on HiDL projects, [register for the HiDL mailing list](https://hidl.cse.ohio-state.edu/mailinglists/).
+
 # MPI4DL v0.5
 
 The size of image-based DL models regularly grows beyond the memory available on a single processor (we call such models **out-of-core**), and require advanced parallelism schemes to fit within device memory. Further, the massive image sizes required in specialized applications such as medical and satellite imaging can themselves place significant device memory pressure, and require parallelism schemes to process efficiently during training. Finally, the simplest parallelism scheme, [layer parallelism](#layer-parallelism), is highly inefficient. While there are several approaches that have been proposed to address some of the limitations of layer parallelism. However, most studies are performed for low-resolution images that exhibit different characteristics. Compared to low-resolution images, high-resolution images (e.g. digital pathology, satellite imaging) result in higher activation memory and larger tensors, which in turn lead to a larger communication overhead.    

diff --git a/benchmarks/gems_master_model/README.md b/benchmarks/gems_master_model/README.md
@@ -0,0 +1,67 @@
+# GEMS: <u>G</u>PU-<u>E</u>nabled <u>M</u>emory-Aware Model-Parallelism <u>S</u>ystem for Distributed DNN Training
+Model Parallelism is necessary for training out-of-core models; however, it can lead to the underutilization of resources. To address this limitation, Pipeline Parallelism is employed, where the batch size is set to greater than 1. But, when dealing with very high-resolution images, certain state-of-the-art models can only work with a unit batch size. GEMS is a memory-efficient design for model parallelism that enables training models with any batch size while utilizing the same resources. For more details, please refer to the original paper: [GEMS: <u>G</u>PU-<u>E</u>nabled <u>M</u>emory-Aware Model-Parallelism <u>S</u>ystem for Distributed DNN Training](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9355254).
+
+## Run GEMS-MASTER:
+
+#### Generic command:
+```bash
+$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${HOSTFILE} MV2_USE_GDRCOPY=0 MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 LD_PRELOAD=$MV2_HOME/lib/libmpi.so python ${gems_model_script} --split-size ${split_size} --image-size ${image_size} --batch-size ${batch_size} --times ${times}
+```
+#### Examples
+
+- Example to run AmoebaNet MASTER model for 1024 * 1024 image size with 4 model split size(i.e. # of partitions for MP), model replication factor (η = 2) and batch size for each model replica as 1 (i.e. effective batch size (EBS) = η × BS = 2).
+
+```bash
+$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${HOSTFILE} MV2_USE_GDRCOPY=0 MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/gems_master_model/benchmark_amoebanet_gems_master.py --split-size 4 --image-size 1024 --batch-size 1 --times 2
+```
+- Similarly, we can run benchmark for ResNet MASTER model.
+Below is example to run ResNet MASTER model for 2048 * 2048 image size with 4 model split size(i.e. # of partitions for MP), model replication factor (η = 4) and batch size for each model replica as 1 (i.e. effective batch size (EBS) = η × BS = 4).
+```bash
+$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${HOSTFILE} MV2_USE_GDRCOPY=0 MV2_ENABLE_AFFINITY=0 MV2_USE_CUDA=1 LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/gems_master_model/benchmark_resnet_gems_master.py --split-size 4 --image-size 2048 --batch-size 1 --times 4 &>> $OUTFILE 2>&1
+
+```
+
+Below are the available configuration options :
+
+<pre>
+usage: benchmark_amoebanet_sp.py [-h] [-v] [--batch-size BATCH_SIZE] [--parts PARTS] [--split-size SPLIT_SIZE] [--num-spatial-parts NUM_SPATIAL_PARTS]
+                        [--spatial-size SPATIAL_SIZE] [--times TIMES] [--image-size IMAGE_SIZE] [--num-epochs NUM_EPOCHS] [--num-layers NUM_LAYERS]
+                        [--num-filters NUM_FILTERS] [--balance BALANCE] [--halo-D2] [--fused-layers FUSED_LAYERS] [--local-DP LOCAL_DP] [--slice-method SLICE_METHOD]
+                        [--app APP] [--datapath DATAPATH]
+
+SP-MP-DP Configuration Script
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -v, --verbose         Prints performance numbers or logs (default: False)
+  --batch-size BATCH_SIZE
+                        input batch size (default: 32)
+  --parts PARTS         Number of parts for MP (default: 1)
+  --split-size SPLIT_SIZE
+                        Number of process for MP (default: 2)
+  --num-spatial-parts NUM_SPATIAL_PARTS
+                        Number of partitions in spatial parallelism (default: 4)
+  --spatial-size SPATIAL_SIZE
+                        Number splits for spatial parallelism (default: 1)
+  --times TIMES         Number of times to repeat MASTER 1: 2 repications, 2: 4 replications (default: 1)
+  --image-size IMAGE_SIZE
+                        Image size for synthetic benchmark (default: 32)
+  --num-epochs NUM_EPOCHS
+                        Number of epochs (default: 1)
+  --num-layers NUM_LAYERS
+                        Number of layers in amoebanet (default: 18)
+  --num-filters NUM_FILTERS
+                        Number of layers in amoebanet (default: 416)
+  --balance BALANCE     length of list equals to number of partitions and sum should be equal to num layers (default: None)
+  --halo-D2             Enable design2 (do halo exhange on few convs) for spatial conv. (default: False)
+  --fused-layers FUSED_LAYERS
+                        When D2 design is enables for halo exchange, number of blocks to fuse in ResNet model (default: 1)
+  --local-DP LOCAL_DP   LBANN intergration of SP with MP. MP can apply data parallelism. 1: only one GPU for a given split, 2: two gpus for a given split (uses DP)
+                        (default: 1)
+  --slice-method SLICE_METHOD
+                        Slice method (square, vertical, and horizontal) in Spatial parallelism (default: square)
+  --app APP             Application type (1.medical, 2.cifar, and synthetic) in Spatial parallelism (default: 3)
+  --datapath DATAPATH   local Dataset path (default: ./train)
+  </pre>
+
+  *Note:"--times" is GEMS specific parameter and certain parameters such as "--num-spatial-parts", "--slice-method", "--halo-D2" would not be required by GEMS.*
diff --git a/...model/benchmark_amoebanet_gems+spatial.py → ...model/benchmark_amoebanet_gems+spatial.py b/...model/benchmark_amoebanet_gems+spatial.py → ...model/benchmark_amoebanet_gems+spatial.py
diff --git a/...ks/gems_model/benchmark_amoebanet_gems.py → ..._model/benchmark_amoebanet_gems_master.py b/...ks/gems_model/benchmark_amoebanet_gems.py → ..._model/benchmark_amoebanet_gems_master.py
@@ -38,7 +38,7 @@ def __getattr__(self, attr):
         return getattr(self.stream, attr)
 
 
-def init_processes(backend="tcp"):
+def init_processes(backend="mpi"):
     """Initialize the distributed environment."""
     dist.init_process_group(backend)
     size = dist.get_world_size()
@@ -57,19 +57,20 @@ def init_processes(backend="tcp"):
 # 2: Cifar
 # 3: synthetic
 APP = args.app
+times = args.times
 image_size = int(args.image_size)
 num_layers = args.num_layers
 num_filters = args.num_filters
 balance = args.balance
 mp_size = args.split_size
 datapath = args.datapath
+num_workers = args.num_workers
 num_classes = args.num_classes
 
 ##################### AmoebaNet GEMS model specific parameters #####################
 
 image_size_seq = 512
 ENABLE_ASYNC = True
-times = 2
 
 ###############################################################################
 mpi_comm = gems_comm.MPIComm(split_size=mp_size, ENABLE_MASTER=True)
@@ -192,7 +193,7 @@ def init_processes(backend="tcp"):
         trainset,
         batch_size=times * batch_size,
         shuffle=True,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = len(my_dataloader.dataset)
@@ -212,10 +213,10 @@ def init_processes(backend="tcp"):
         trainset,
         batch_size=times * batch_size,
         shuffle=False,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
-    size_dataset = 50000
+    size_dataset = len(my_dataloader.dataset)
 else:
     my_dataset = torchvision.datasets.FakeData(
         size=10 * batch_size,
@@ -229,7 +230,7 @@ def init_processes(backend="tcp"):
         my_dataset,
         batch_size=batch_size * times,
         shuffle=False,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = 10 * batch_size

diff --git a/...ms_model/benchmark_resnet_gems+spatial.py → ...er_model/benchmark_resnet_gems+spatial.py b/...ms_model/benchmark_resnet_gems+spatial.py → ...er_model/benchmark_resnet_gems+spatial.py
diff --git a/...marks/gems_model/benchmark_resnet_gems.py → ...ter_model/benchmark_resnet_gems_master.py b/...marks/gems_model/benchmark_resnet_gems.py → ...ter_model/benchmark_resnet_gems_master.py
@@ -38,7 +38,7 @@ def __getattr__(self, attr):
         return getattr(self.stream, attr)
 
 
-def init_processes(backend="tcp"):
+def init_processes(backend="mpi"):
     """Initialize the distributed environment."""
     dist.init_process_group(backend)
     size = dist.get_world_size()
@@ -58,20 +58,21 @@ def init_processes(backend="tcp"):
 # 2: Cifar
 # 3: synthetic
 APP = args.app
+times = args.times
 image_size = int(args.image_size)
 num_layers = args.num_layers
 num_filters = args.num_filters
 balance = args.balance
 mp_size = args.split_size
 datapath = args.datapath
+num_workers = args.num_workers
 num_classes = args.num_classes
 
 ################## ResNet model specific parameters/functions ##################
 
 image_size_seq = 32
 ENABLE_ASYNC = True
 resnet_n = 12
-times = 2
 
 
 def get_depth(version, n):
@@ -208,7 +209,7 @@ def get_depth(version, n):
         trainset,
         batch_size=times * batch_size,
         shuffle=True,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = len(my_dataloader.dataset)
@@ -220,7 +221,7 @@ def get_depth(version, n):
         trainset,
         batch_size=times * batch_size,
         shuffle=False,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = len(my_dataloader.dataset)
@@ -237,7 +238,7 @@ def get_depth(version, n):
         my_dataset,
         batch_size=batch_size * times,
         shuffle=False,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = 10 * batch_size

diff --git a/benchmarks/layer_parallelism/benchmark_amoebanet_lp.py b/benchmarks/layer_parallelism/benchmark_amoebanet_lp.py
@@ -70,6 +70,7 @@ def __getattr__(self, attr):
 mp_size = args.split_size
 times = args.times
 datapath = args.datapath
+num_workers = args.num_workers
 # APP
 # 1: Medical
 # 2: Cifar
@@ -186,7 +187,7 @@ def __getattr__(self, attr):
         trainset,
         batch_size=times * batch_size,
         shuffle=True,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = len(my_dataloader.dataset)
@@ -198,7 +199,7 @@ def __getattr__(self, attr):
         trainset,
         batch_size=times * batch_size,
         shuffle=False,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = 50000
@@ -215,7 +216,7 @@ def __getattr__(self, attr):
         my_dataset,
         batch_size=batch_size * times,
         shuffle=False,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = 10 * batch_size

diff --git a/benchmarks/layer_parallelism/benchmark_resnet_lp.py b/benchmarks/layer_parallelism/benchmark_resnet_lp.py
@@ -67,6 +67,7 @@ def __getattr__(self, attr):
 mp_size = args.split_size
 times = args.times
 datapath = args.datapath
+num_workers = args.num_workers
 # APP
 # 1: Medical
 # 2: Cifar
@@ -197,7 +198,7 @@ def get_depth(version, n):
         trainset,
         batch_size=times * batch_size,
         shuffle=True,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = len(my_dataloader.dataset)
@@ -209,7 +210,7 @@ def get_depth(version, n):
         trainset,
         batch_size=times * batch_size,
         shuffle=False,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = 50000
@@ -226,7 +227,7 @@ def get_depth(version, n):
         my_dataset,
         batch_size=batch_size * times,
         shuffle=False,
-        num_workers=0,
+        num_workers=num_workers,
         pin_memory=True,
     )
     size_dataset = 10 * batch_size

diff --git a/benchmarks/spatial_parallelism/README.md b/benchmarks/spatial_parallelism/README.md
@@ -14,13 +14,13 @@ $MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile  {$HOSTFILE} MV2_USE_CU
 
 - With 5 GPUs [split size: 2, num_spatial_parts: 4, spatial_size: 1]
 
-Example to run AmoebaNet model with 2 model split size(i.e. # of partitions for MP), spatial partition (# of image partitions) as 4 and 1 as spatial size (i.e. number of model partition which will use spatial partition). In this configuration, we split model into two parts where first part will use spatial parallelism. 
+Example to run AmoebaNet model with 2 model split size(i.e. # of partitions for MP), spatial partition (# of image partitions) as 4 and 1 as spatial size (i.e. number of model partition which will use spatial partition). In this configuration, we split model into two parts where first part will use spatial parallelism.
 
 ```bash
 $MV2_HOME/bin/mpirun_rsh --export-all -np 5 --hostfile {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/spatial_parallelism/benchmark_amoebanet_sp.py --image-size 512 --num-spatial-parts 4 --slice-method "vertical" --split-size 2 --spatial-size 1
 ```
 - With 9 GPUs [split size: 3, num_spatial_parts: 4, spatial_size: 2]
-In this configuration, we split model int three parts where first two part will use spatial parallelism. 
+In this configuration, we split model int three parts where first two part will use spatial parallelism.
 
 ```bash
 $MV2_HOME/bin/mpirun_rsh --export-all -np 9 --hostfile {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/spatial_parallelism/benchmark_amoebanet_sp.py --image-size 512 --num-spatial-parts 4 --slice-method "vertical" --split-size 3 --spatial-size 2
@@ -30,7 +30,19 @@ $MV2_HOME/bin/mpirun_rsh --export-all -np 9 --hostfile {$HOSTFILE} MV2_USE_CUDA=
 Find the example to run ResNet with halo-D2 enabled to reduce communication opertaions. To learn more about halo-D2, refer [Hy-Fi: Hybrid Five-Dimensional Parallel DNN Training on High-Performance GPU Clusters](https://dl.acm.org/doi/abs/10.1007/978-3-031-07312-0_6)
 ```bash
 $MV2_HOME/bin/mpirun_rsh --export-all -np 5 --hostfile {$HOSTFILE} MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so benchmarks/spatial_parallelism/benchmark_resnet_sp.py --halo-D2 --num-spatial-parts 4 --image-size 1024 --batch-size 2 --slice-method "square"
-``` 
+```
+
+## Run spatial + data parallelism:
+Currently SP + DP has been supported for AmoebaNet.
+
+- Enable Data Parallelism using <i>"local-DP"</i> argument.
+- Example to run AmoebaNet model with 2 data partition, 2 model split size(i.e. # of partitions for MP), spatial partition (# of image partitions) as 4 and 1 as spatial size (i.e. number of model partition which will use spatial partition). In this configuration, we have 2 data partition and for each part, model will split into two parts where first part will use spatial parallelism.
+
+
+```bash
+$MV2_HOME/bin/mpirun_rsh --export-all -np $np --hostfile ${hostfile}  MV2_USE_CUDA=1 MV2_HYBRID_BINDING_POLICY=spread MV2_CPU_BINDING_POLICY=hybrid MV2_USE_GDRCOPY=0 PYTHONNOUSERSITE=true LD_PRELOAD=$MV2_HOME/lib/libmpi.so python benchmarks/spatial_parallelism/benchmark_amoebanet_sp.py --local-DP 2 --image-size ${image_size} --batch-size ${batch_size} --slice-method ${partition}
+```
+
 
 Below are the available configuration options :