-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多卡训练感觉不是并发的? #18
Comments
正常的,因为目前代码里设置的是 如果想改成 data parallel,则将 device_map="auto" 改为: |
好的,谢谢大佬 |
你好,我使用model parallel训练,训练了29step就卡住了,gpu运行0%,cpu运行100%。请问这种情况你碰到过吗? |
暂时没碰到。有什么报错吗 |
通常训练到一半中道崩猝的都是因为内存/显存啥的不够,缩小batch size,gradient accumulate等参数再试试? |
没有报错,目前来看应该是data_collator有问题,因为chatglm可以正常跑,百川用的DataCollatorForLanguageModeling会卡住。我用的是V100。 |
缩小batch没有用,应该是数据load问题 |
你好,我训练baichuan的时候也遇到了这个问题,训练卡住了,但显存仍然占用着 |
看了一下GPU的使用率,是一个个跳100%的,你们有没有这种情况?
The text was updated successfully, but these errors were encountered: