We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
机器: h100-80g-hbm3 基于以下的chatglm-6b-xxx.json配置测试,在tp1, bs24, inputlen1024下报OOM
修改json配置为以下,从tp1, bs24, inputlen1024开始跑,tp1, bs24, inputlen1024可以正常运行
从代码https://github.com/bytedance/ByteMLPerf/blob/main/byte_infer_perf/llm_perf/launch.py#L260猜测是代码未能按照预期所示等待各配置子进程结束再launch下一个子进程,从而导致子进程发生GPU争抢,导致本能跑的配置在争抢环境下发生OOM
step1 launch container docker run --net=host --pid=host --ipc=host --shm-size 64g --privileged -it --gpus all -v xxx:xxx --name xxxx nvcr.io/nvidia/pytorch:24.08-py3 step2 enter dir of launch.py pip install -r requirements.txt step3 修改workloads/chatglm2-torch-fp16-6b.json如以上所示 step4 run python3 launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b
docker run --net=host --pid=host --ipc=host --shm-size 64g --privileged -it --gpus all -v xxx:xxx --name xxxx nvcr.io/nvidia/pytorch:24.08-py3
pip install -r requirements.txt
python3 launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b
The text was updated successfully, but these errors were encountered:
@suisiyuan 你好,有空的时候可以帮忙看一看不?
Sorry, something went wrong.
好的,我这边看看,应该是进程管理的问题。
@suisiyuan 你好,有空的时候可以帮忙看一看不? 好的,我这边看看,应该是进程管理的问题。
感谢你的时间
No branches or pull requests
错误描述
机器: h100-80g-hbm3
基于以下的chatglm-6b-xxx.json配置测试,在tp1, bs24, inputlen1024下报OOM
修改json配置为以下,从tp1, bs24, inputlen1024开始跑,tp1, bs24, inputlen1024可以正常运行
从代码https://github.com/bytedance/ByteMLPerf/blob/main/byte_infer_perf/llm_perf/launch.py#L260猜测是代码未能按照预期所示等待各配置子进程结束再launch下一个子进程,从而导致子进程发生GPU争抢,导致本能跑的配置在争抢环境下发生OOM
复现步骤
step1 launch container
docker run --net=host --pid=host --ipc=host --shm-size 64g --privileged -it --gpus all -v xxx:xxx --name xxxx nvcr.io/nvidia/pytorch:24.08-py3
step2 enter dir of launch.py
pip install -r requirements.txt
step3 修改workloads/chatglm2-torch-fp16-6b.json如以上所示
step4 run
python3 launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b
The text was updated successfully, but these errors were encountered: