You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801455 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801455 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102175 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102176 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102177 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102178 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102179 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102180 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102181 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 102174) of binary: /home/jovyan/miniconda3/envs/sd-webui2/bin/python
Traceback (most recent call last):
File "/home/jovyan/miniconda3/envs/sd-webui2/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/home/jovyan/miniconda3/envs/sd-webui2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/jovyan/miniconda3/envs/sd-webui2/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/jovyan/miniconda3/envs/sd-webui2/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/jovyan/miniconda3/envs/sd-webui2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/jovyan/miniconda3/envs/sd-webui2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
hcpdiff.train_colo FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
单机多卡训练过程中出现accelerate ProcessGroupNCCL 超时问题。
当前使用训练命令:
torchrun --nproc_per_node 8 -m hcpdiff.train_colo --cfg cfgs/train/examples/fine-tuning.yaml
训练配置:
base: [cfgs/train/train_base.yaml, cfgs/train/tuning_base.yaml]
报错内容:
=======================================================
额外内容:
报错发生在训练进行中卡住大概半小时后报错。
训练日志如下:
2023-04-21 17:46:45.897 | INFO | hcpdiff.train_ac:init:59 - world_size: 8
2023-04-21 17:46:45.897 | INFO | hcpdiff.train_ac:init:60 - accumulation: 1
2023-04-21 17:49:29.190 | INFO | hcpdiff.train_ac:build_data:265 - len(train_dataset): 352
2023-04-21 17:54:20.501 | INFO | hcpdiff.train_ac:train:338 - ***** Running training *****
2023-04-21 17:54:20.502 | INFO | hcpdiff.train_ac:train:339 - Num batches each epoch = 11
2023-04-21 17:54:20.504 | INFO | hcpdiff.train_ac:train:340 - Num Steps = 600
2023-04-21 17:54:20.504 | INFO | hcpdiff.train_ac:train:341 - Instantaneous batch size per device = 4
2023-04-21 17:54:20.505 | INFO | hcpdiff.train_ac:train:342 - Total train batch size (w. parallel, distributed & accumulation) = 32
2023-04-21 17:54:20.506 | INFO | hcpdiff.train_ac:train:343 - Gradient Accumulation steps = 1
2023-04-21 17:54:51.571 | INFO | hcpdiff.train_ac:train:363 - Step [20/600], LR_model: 1.28e-05, LR_word: 0.00e+00, Loss: 0.14078
2023-04-21 17:55:14.095 | INFO | hcpdiff.train_ac:train:363 - Step [40/600], LR_model: 2.56e-05, LR_word: 0.00e+00, Loss: 0.10552
2023-04-21 17:55:35.954 | INFO | hcpdiff.train_ac:train:363 - Step [60/600], LR_model: 3.20e-05, LR_word: 0.00e+00, Loss: 0.12115
2023-04-21 17:55:57.702 | INFO | hcpdiff.train_ac:train:363 - Step [80/600], LR_model: 3.20e-05, LR_word: 0.00e+00, Loss: 0.14846
The text was updated successfully, but these errors were encountered: