Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【llm_perf issue】using byte_infer_perf/llm_perf/launch.py to test chatglm, but meet multi-process competing #112

Open
danielhua23 opened this issue Oct 11, 2024 · 3 comments

Comments

@danielhua23
Copy link

danielhua23 commented Oct 11, 2024

错误描述

机器: h100-80g-hbm3
基于以下的chatglm-6b-xxx.json配置测试,在tp1, bs24, inputlen1024下报OOM
image

image
修改json配置为以下,从tp1, bs24, inputlen1024开始跑,tp1, bs24, inputlen1024可以正常运行
image
image

从代码https://github.com/bytedance/ByteMLPerf/blob/main/byte_infer_perf/llm_perf/launch.py#L260猜测是代码未能按照预期所示等待各配置子进程结束再launch下一个子进程,从而导致子进程发生GPU争抢,导致本能跑的配置在争抢环境下发生OOM

复现步骤

step1 launch container
docker run --net=host --pid=host --ipc=host --shm-size 64g --privileged -it --gpus all -v xxx:xxx --name xxxx nvcr.io/nvidia/pytorch:24.08-py3
step2 enter dir of launch.py
pip install -r requirements.txt
step3 修改workloads/chatglm2-torch-fp16-6b.json如以上所示
step4 run
python3 launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b

@danielhua23
Copy link
Author

@suisiyuan 你好,有空的时候可以帮忙看一看不?

@suisiyuan
Copy link
Collaborator

@suisiyuan 你好,有空的时候可以帮忙看一看不?

好的,我这边看看,应该是进程管理的问题。

@danielhua23
Copy link
Author

@suisiyuan 你好,有空的时候可以帮忙看一看不?

好的,我这边看看,应该是进程管理的问题。

感谢你的时间

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants