OPT-Benchmark

This benchmark is to compare the performance of Colossal-AI and DeepSpeed in terms of its zero redundancy optimizer and offloading. The script is adapted from the Hugging Face example.

Run Benchmarking

First, you need to install the following libraries.

# assuming using cuda 11.3
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3 -c pytorch
pip install colossalai==0.1.10+torch1.11cu11.3 -f https://release.colossalai.org
pip install accelerate==0.10.0 datasets==1.18.4 transformers==4.21.0 deepspeed==0.6.5 tqdm

To run the benchmarking with different acceleration libraries, you can just execute the following bash script on a single node. We recommend you to run the run_opt_clm.sh script with one GPU first so as to download all the necessary files from Hugging Face.

# run with deepspeed zero 3 + offloading
bash ./run_opt_clm.sh

# run with the current version of colossal-ai zero module
bash ./run_opt_clm_colossalai.sh

# run with the newer (experimental) version of colossal-ai zero module
bash ./run_opt_clm_colossalai_new.sh

Each script has 4 arguments.

BS: batch size per GPU
MEMCAP: whether to limit the GPU memory usage. For example, if MEMCAP = 40, the program will only use 40 GB memory even if the GPU has 80 GB. If MEMCAP=0, there is limit on the available GPU memory. The default value is 0.
MODEL: the variant of the OPT MODEL, default is 13B.
GPUNUM: the number of GPUs to use, default is 8.

Sometimes you might encounter OOM with Colossal-AI, please tune the parameters in colossalai_zero.py. More specifically, you could try to decrease the parameter warmup_non_model_data_ratio and gpu_margin_mem_ratio.

# try to decrase warmup_non_model_data_ratio and gpu_margin_mem_ratio
# if you encounter OOM error
zero = dict(model_config=dict(shard_strategy=TensorShardStrategy(),
                              tensor_placement_policy=_policy,
                              reuse_fp16_shard=True,
                              warmup_non_model_data_ratio=0.7),
            optimizer_config=dict(gpu_margin_mem_ratio=0.8, initial_scale=2**8))

Test Results

We run our code with the hardware platform

#GPUs per Node	8
GPU	A100 (80G)
CPU Memory Per Node	1900 GB
#vCPU	110
RDMA	Yes

The followings are results for Colossal-AI vs DeepSpeed

#Node	#GPUs	Model	System	Policy	Batch Size Per GPU	Global Batch Size	Step Time	Max Allocated	Max Reserved	Throughput (sample per second)
1	1	OPT-13B	DeepSpeed	ZeRO3	24	24	51.74	32.38	77.31	0.463
			DeepSpeed	ZeRO3	32	32	64.88	41.88	72.65	0.493
			Colossal-AI	auto (warmup_non_model_data_ratio=0.7, gpu_margin_mem_ratio=0.8)	24	24	41.50	71.05	77.06	0.578
					32	32	OOM
				auto (warmup_non_model_data_ratio=0.4, gpu_margin_mem_ratio=0.5)	32	32	51.50	72.73	77.21	0.621
				cpu	32	32	91.68	45.33	76.38	0.349
	8	OPT-30B	DeepSpeed	ZeRO3	16	128	73.95	32.19	76.38	1.73
			DeepSpeed	ZeRO3	32	256	99.86	59.89	76.59	2.56
			Colossal-AI	auto	16	128	37.48	63.61	76.22	3.41
				auto	32	256	OOM
				cpu	32	256	84.78	65.21	75.63	3.02

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
experiments		experiments
.gitignore		.gitignore
README.md		README.md
benchmark.sh		benchmark.sh
colossalai_zero.py		colossalai_zero.py
license		license
requirements.txt		requirements.txt
run_clm_no_trainer.py		run_clm_no_trainer.py
run_clm_no_trainer_colossalai.py		run_clm_no_trainer_colossalai.py
run_clm_no_trainer_colossalai_new.py		run_clm_no_trainer_colossalai_new.py
run_opt_clm.sh		run_opt_clm.sh
run_opt_clm_colossalai.sh		run_opt_clm_colossalai.sh
run_opt_clm_colossalai_new.sh		run_opt_clm_colossalai_new.sh
utils.py		utils.py
zero_config.json		zero_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OPT-Benchmark

Run Benchmarking

Test Results

About

Releases

Packages

Contributors 5

Languages

License

hpcaitech/OPT-Benchmark

Folders and files

Latest commit

History

Repository files navigation

OPT-Benchmark

Run Benchmarking

Test Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages