Skip to content

Latest commit

 

History

History
110 lines (84 loc) · 6.04 KB

README.md

File metadata and controls

110 lines (84 loc) · 6.04 KB

Zeus2: Implementing Structure for Heterogenous GPUs for DNN Training Energy Savings

Zeus is a framework for (1) measuring GPU energy consumption and (2) optimizing energy and time for DNN training. A summary can be found here, and the research paper for Zeus can be found here.

Zeus assumes that the same types of GPUs are used during DNN training. This extension accounts for heterogeneous GPUs used in DNN training. First, each GPU is profiled when trained on different datasets so we can plot its power limits and batch sizes. This profiling is used to generate an optimal allocation of both global batch size and power limits across heterogeneous GPUs. Then, for all future cases of DNN training, this optimal allocation can be used.

Launching GPUs on AWS

We use AWS to simulate heterogeneous GPUs:

  1. Launch 2 different EC2 instances. We used g5 and g4dn instances, which correspond to Nvidia A10 and Nvidia T4 GPUs.
  2. Select a Deep Learning OSS Nvidia Driver. We selected AMI GPU PyTorch 2.1.0 (Ubuntu 20.04).

Setting up Zeus

Run a docker container to set up Zeus. Some datasets will also require adding a volume to the run container command.

docker run -it \
    --gpus all                  `# Mount all GPUs` \
    --cap-add SYS_ADMIN         `# Needed to change the power limit of the GPU` \
    --ipc host                  `# PyTorch DataLoader workers need enough shm` \
    mlenergy/zeus:latest \
    bash

Inside the docker container, clone this repository. Replace any Files of Interest inside zeus with the files from zeus_extension.

git clone https://github.com/AditiR-42/zeus_extension.git
  • If using Cifar100, setup is complete.
  • If using Imagenet, download the Imagenet dataset and add it as a volume to the docker container.

Generating Profiling

To generate profiling traces for each GPU, run the following command for the Cifar100 dataset:

python zeus/examples/ZeusDataLoader/cifar100/run_profiling.py \
    --profile_folder NAME \
    --epochs 1 \
    --batch_sizes 32 64 128 256 512 1024 \
    --power_limits 70 65 60

or the following command for the Imagenet dataset:

python zeus/examples/imagenet/run_profiling.py \
    --profile_folder NAME \
    --epochs 1 \
    --batch_sizes 32 64 128 256 512 \
    --power_limits 70 65 60

The profile_folder should be a unique string, epochs can be set to 1, batch_sizes depend on the dataset, and power_limits depend on the GPU type. If needed, warmup_step and profiling_steps can also be edited via command-line arguments. For more information on setting power limits and batch sizes, see Determining Constants.

The example trace files generated (for Cifar100 and Imagenet on A10 and T4 GPUs) can be viewed in the trace_aws folder.

Running Algorithm

Determine which of the two GPUs is stronger using peta-flop characteristics from their datasheets. In our case, A10 is stronger than T4. Then run the following code, ensuring that gpu1 and trace1 correspond to the stronger of the two GPUs.

python zeus_heterogeneous_algorithm.py --gpu1 NAME --gpu2 NAME --trace1 PATH --trace2 PATH

The resulting output will show the optimal power limit and global batch size allocation for each GPU using the brute force, heuristic, and baseline methods.

Training Model

To train the model, run the following command for the Cifar100 dataset:

python examples/ZeusDataLoader/cifar100/train.py \
    --epochs INT \
    --power_limit INT \
    --gpu_index INT \ 
    --gpu_split INT 

or the following command for the Imagenet dataset:

python examples/imagenet/train_single.py \
    --epochs INT \
    --power_limit INT \
    --gpu_index INT \ 
    --gpu_split INT \
    --data /imagenet

The epochs are user-defined (or can be the default). The power_limit should be the optimal power limit obtained from the algorithm output in the previous step. The gpu_split is determined by calculating how much of the global batch size is allocated to the stronger GPU. The gpu_index should be 0 for the GPU with the smaller workload and 1 for the GPU with the larger workload. For example, a 40-60 workload split would mean gpu_split is 40 for both GPUs and gpu_index is 0 and 1 for the respective GPUs.

The train files will automatically shard the model across the two GPUs according to the gpu_split. The final output of the train files will be the Time (s) taken and Energy (J) consumed during training. These results can then be compared across the baseline and heuristic methods.

Appendix

Determining Constants

Power limits depend on the GPU. To determine which power limits to use for profiling, run nvidia-smi -pl 0 to see the minimum and maximum power limits. For A10, we used power limits of [300, 250, 200, 150, 100]. For T4, we used [70, 65, 60].

Batch size depends on the model, or dataset. To determine which batch sizes to use, experiment with the training file to see how large of a batch size is possible for that particular model. Batch size should generally ncrease by powers of 2. For Cifar100, we used batch sizes of [32, 64, 128, 256, 512, 1024]. For Imagenet, we used batch sizes of [32, 64, 128, 256, 512].

Files of Interest