Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

无法在两张gpu运行 #9

Closed
user2311717757 opened this issue Jun 16, 2023 · 5 comments
Closed

无法在两张gpu运行 #9

user2311717757 opened this issue Jun 16, 2023 · 5 comments

Comments

@user2311717757
Copy link

我运行./src/ft_chatglm_lora/train.sh,并修改CUDA_VISIBLE_DEVICES=0,1,2,3。但是报错RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!请问我如何在多卡下跑代码

@michael-wzhu
Copy link
Owner

采用多卡训练需要借助于分布式训练机制,相应的训练启动指令会有改变,例如以单节点四卡为例:
torchrun \ --nnodes 1 \ --nproc_per_node 4 \ --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12490 \ your_python_script.py arguments

@user2311717757
Copy link
Author

采用多卡训练需要借助于分布式训练机制,相应的训练启动指令会有改变,例如以单节点四卡为例:
torchrun \ --nnodes 1 \ --nproc_per_node 4 \ --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12490 \ your_python_script.py arguments

我使用上述指令重新跑了baseline的代码,但是出现了新的问题:
Parameter at index 223 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debugging.
return inner_training_loop(
File "/mnt1/dataln0/nianke/PromptCBLUE/./src/ft_chatglm_lora/trainer.py", line 1909, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/mnt1/dataln0/nianke/PromptCBLUE/./src/ft_chatglm_lora/trainer.py", line 2670, in training_step
loss.backward()
File "/root/dataln/anaconda3/envs/ccks/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/root/dataln/anaconda3/envs/ccks/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/dataln/anaconda3/envs/ccks/lib/python3.9/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/root/dataln/anaconda3/envs/ccks/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 157, in backward
torch.autograd.backward(outputs_with_grad, args_with_grad)
File "/root/dataln/anaconda3/envs/ccks/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2937046) of binary: /root/dataln/anaconda3/envs/ccks/bin/python
Traceback (most recent call last):
File "/root/dataln/anaconda3/envs/ccks/bin/torchrun", line 8, in
sys.exit(main())
File "/root/dataln/anaconda3/envs/ccks/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/root/dataln/anaconda3/envs/ccks/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/dataln/anaconda3/envs/ccks/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/dataln/anaconda3/envs/ccks/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/dataln/anaconda3/envs/ccks/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
我尝试了一下还是无法解决,不知道你有没有什么建议。

采用多卡训练需要借助于分布式训练机制,相应的训练启动指令会有改变,例如以单节点四卡为例:
torchrun \ --nnodes 1 \ --nproc_per_node 4 \ --rdzv_id=1 --rdzv_backend=c10d --rdzv_endpoint=localhost:12490 \ your_python_script.py arguments

@michael-wzhu michael-wzhu closed this as not planned Won't fix, can't repro, duplicate, stale Jun 25, 2023
@michael-wzhu michael-wzhu reopened this Jun 25, 2023
@lingbai-kong

This comment was marked as resolved.

@lingbai-kong
Copy link

看样子是比较底层且广泛的问题,参见mlp.c_proj.lora_B.default.weight has been marked as ready twice,可能目前确实不能分布式训练adalora了。另有可能的一些解决方法,在这里有讨论:Error with Multi-GPU peft Reward Training,好像关闭gradient checkpointing可以解决

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants