Skip to content
This repository has been archived by the owner on Oct 16, 2023. It is now read-only.

hpcaitech/OPT-Benchmark

Repository files navigation

OPT-Benchmark

This benchmark is to compare the performance of Colossal-AI and DeepSpeed in terms of its zero redundancy optimizer and offloading. The script is adapted from the Hugging Face example.

Run Benchmarking

First, you need to install the following libraries.

# assuming using cuda 11.3
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org
pip install accelerate==0.10.0 datasets==1.18.4 transformers==4.21.0 deepspeed==0.6.5 tqdm

To run the benchmarking with different acceleration libraries, you can just execute the following bash script on a single node. We recommend you to run the run_opt_clm.sh script with one GPU first so as to download all the necessary files from Hugging Face.

# run with deepspeed zero 3 + offloading
bash ./run_opt_clm.sh

# run with the current version of colossal-ai zero module
bash ./run_opt_clm_colossalai.sh

# run with the newer (experimental) version of colossal-ai zero module
bash ./run_opt_clm_colossalai_new.sh

Each script has 4 arguments.

  • BS: batch size per GPU
  • MEMCAP: whether to limit the GPU memory usage. For example, if MEMCAP = 40, the program will only use 40 GB memory even if the GPU has 80 GB. If MEMCAP=0, there is limit on the available GPU memory. The default value is 0.
  • MODEL: the variant of the OPT MODEL, default is 13B.
  • GPUNUM: the number of GPUs to use, default is 8.

Sometimes you might encounter OOM with Colossal-AI, please tune the parameters in colossalai_zero.py. More specifically, you could try to decrease the parameter warmup_non_model_data_ratio and gpu_margin_mem_ratio.

# try to decrase warmup_non_model_data_ratio and gpu_margin_mem_ratio
# if you encounter OOM error
zero = dict(model_config=dict(shard_strategy=TensorShardStrategy(),
                              tensor_placement_policy=_policy,
                              reuse_fp16_shard=True,
                              warmup_non_model_data_ratio=0.7),
            optimizer_config=dict(gpu_margin_mem_ratio=0.8, initial_scale=2**8))

Test Results

We run our code with the hardware platform

#GPUs per Node 8
GPU A100 (80G)
CPU Memory Per Node 1900 GB
#vCPU 110
RDMA Yes

The followings are results for Colossal-AI vs DeepSpeed

#Node #GPUs Model System Policy Batch Size
Per GPU
Global Batch Size Step Time Max Allocated Max Reserved Throughput
(sample per second)
1 1 OPT-13B DeepSpeed ZeRO3 24 24 51.74 32.38 77.31 0.463
32 32 64.88 41.88 72.65 0.493
Colossal-AI auto
(warmup_non_model_data_ratio=0.7,
gpu_margin_mem_ratio=0.8)
24 24 41.50 71.05 77.06 0.578
32 32 OOM
auto
(warmup_non_model_data_ratio=0.4,
gpu_margin_mem_ratio=0.5)
32 32 51.50 72.73 77.21 0.621
cpu 32 32 91.68 45.33 76.38 0.349
8 OPT-30B DeepSpeed ZeRO3 16 128 73.95 32.19 76.38 1.73
32 256 99.86 59.89 76.59 2.56
Colossal-AI auto 16 128 37.48 63.61 76.22 3.41
32 256 OOM
cpu 32 256 84.78 65.21 75.63 3.02

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published