-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
无法在两张gpu运行 #9
Comments
采用多卡训练需要借助于分布式训练机制,相应的训练启动指令会有改变,例如以单节点四卡为例: |
我使用上述指令重新跑了baseline的代码,但是出现了新的问题: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2937046) of binary: /root/dataln/anaconda3/envs/ccks/bin/python
|
This comment was marked as resolved.
This comment was marked as resolved.
看样子是比较底层且广泛的问题,参见mlp.c_proj.lora_B.default.weight has been marked as ready twice,可能目前确实不能分布式训练adalora了。另有可能的一些解决方法,在这里有讨论:Error with Multi-GPU peft Reward Training,好像关闭gradient checkpointing可以解决 |
我运行./src/ft_chatglm_lora/train.sh,并修改CUDA_VISIBLE_DEVICES=0,1,2,3。但是报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!请问我如何在多卡下跑代码
The text was updated successfully, but these errors were encountered: