This benchmark is to compare the performance of Colossal-AI and DeepSpeed in terms of its zero redundancy optimizer and offloading. The script is adapted from the Hugging Face example.
First, you need to install the following libraries.
# assuming using cuda 11.3
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org
pip install accelerate==0.10.0 datasets==1.18.4 transformers==4.21.0 deepspeed==0.6.5 tqdm
To run the benchmarking with different acceleration libraries, you can just execute the following bash script on a single node. We recommend you to run the run_opt_clm.sh
script with one GPU first so as to download all the necessary files from Hugging Face.
# run with deepspeed zero 3 + offloading
bash ./run_opt_clm.sh
# run with the current version of colossal-ai zero module
bash ./run_opt_clm_colossalai.sh
# run with the newer (experimental) version of colossal-ai zero module
bash ./run_opt_clm_colossalai_new.sh
Each script has 4 arguments.
BS
: batch size per GPUMEMCAP
: whether to limit the GPU memory usage. For example, if MEMCAP = 40, the program will only use 40 GB memory even if the GPU has 80 GB. If MEMCAP=0, there is limit on the available GPU memory. The default value is 0.MODEL
: the variant of the OPT MODEL, default is13B
.- GPUNUM: the number of GPUs to use, default is 8.
Sometimes you might encounter OOM with Colossal-AI, please tune the parameters in colossalai_zero.py
. More specifically, you could try to decrease the parameter warmup_non_model_data_ratio
and gpu_margin_mem_ratio
.
# try to decrase warmup_non_model_data_ratio and gpu_margin_mem_ratio
# if you encounter OOM error
zero = dict(model_config=dict(shard_strategy=TensorShardStrategy(),
tensor_placement_policy=_policy,
reuse_fp16_shard=True,
warmup_non_model_data_ratio=0.7),
optimizer_config=dict(gpu_margin_mem_ratio=0.8, initial_scale=2**8))
We run our code with the hardware platform
#GPUs per Node | 8 |
GPU | A100 (80G) |
CPU Memory Per Node | 1900 GB |
#vCPU | 110 |
RDMA | Yes |
The followings are results for Colossal-AI vs DeepSpeed
#Node | #GPUs | Model | System | Policy | Batch Size Per GPU |
Global Batch Size | Step Time | Max Allocated | Max Reserved | Throughput (sample per second) |
---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | OPT-13B | DeepSpeed | ZeRO3 | 24 | 24 | 51.74 | 32.38 | 77.31 | 0.463 |
32 | 32 | 64.88 | 41.88 | 72.65 | 0.493 | |||||
Colossal-AI | auto (warmup_non_model_data_ratio=0.7, gpu_margin_mem_ratio=0.8) |
24 | 24 | 41.50 | 71.05 | 77.06 | 0.578 | |||
32 | 32 | OOM | ||||||||
auto (warmup_non_model_data_ratio=0.4, gpu_margin_mem_ratio=0.5) |
32 | 32 | 51.50 | 72.73 | 77.21 | 0.621 | ||||
cpu | 32 | 32 | 91.68 | 45.33 | 76.38 | 0.349 | ||||
8 | OPT-30B | DeepSpeed | ZeRO3 | 16 | 128 | 73.95 | 32.19 | 76.38 | 1.73 | |
32 | 256 | 99.86 | 59.89 | 76.59 | 2.56 | |||||
Colossal-AI | auto | 16 | 128 | 37.48 | 63.61 | 76.22 | 3.41 | |||
32 | 256 | OOM | ||||||||
cpu | 32 | 256 | 84.78 | 65.21 | 75.63 | 3.02 |