【llm_perf issue】using byte_infer_perf/llm_perf/launch.py to test chatglm, but meet multi-process competing #112

danielhua23 · 2024-10-11T02:51:25Z

错误描述

机器: h100-80g-hbm3
基于以下的chatglm-6b-xxx.json配置测试，在tp1, bs24, inputlen1024下报OOM

修改json配置为以下，从tp1, bs24, inputlen1024开始跑，tp1, bs24, inputlen1024可以正常运行

从代码https://github.com/bytedance/ByteMLPerf/blob/main/byte_infer_perf/llm_perf/launch.py#L260猜测是代码未能按照预期所示等待各配置子进程结束再launch下一个子进程，从而导致子进程发生GPU争抢，导致本能跑的配置在争抢环境下发生OOM

复现步骤

step1 launch container
docker run --net=host --pid=host --ipc=host --shm-size 64g --privileged -it --gpus all -v xxx:xxx --name xxxx nvcr.io/nvidia/pytorch:24.08-py3
step2 enter dir of launch.py
pip install -r requirements.txt
step3 修改workloads/chatglm2-torch-fp16-6b.json如以上所示
step4 run
python3 launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b

The text was updated successfully, but these errors were encountered:

danielhua23 · 2024-10-15T02:02:12Z

@suisiyuan 你好，有空的时候可以帮忙看一看不？

suisiyuan · 2024-10-15T03:15:37Z

@suisiyuan 你好，有空的时候可以帮忙看一看不？

好的，我这边看看，应该是进程管理的问题。

danielhua23 · 2024-10-15T06:06:32Z

@suisiyuan 你好，有空的时候可以帮忙看一看不？

好的，我这边看看，应该是进程管理的问题。

感谢你的时间

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【llm_perf issue】using byte_infer_perf/llm_perf/launch.py to test chatglm, but meet multi-process competing #112

【llm_perf issue】using byte_infer_perf/llm_perf/launch.py to test chatglm, but meet multi-process competing #112

danielhua23 commented Oct 11, 2024 •

edited

Loading

danielhua23 commented Oct 15, 2024

suisiyuan commented Oct 15, 2024

danielhua23 commented Oct 15, 2024

【llm_perf issue】using byte_infer_perf/llm_perf/launch.py to test chatglm, but meet multi-process competing #112

【llm_perf issue】using byte_infer_perf/llm_perf/launch.py to test chatglm, but meet multi-process competing #112

Comments

danielhua23 commented Oct 11, 2024 • edited Loading

错误描述

复现步骤

danielhua23 commented Oct 15, 2024

suisiyuan commented Oct 15, 2024

danielhua23 commented Oct 15, 2024

danielhua23 commented Oct 11, 2024 •

edited

Loading