Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUGreport:[Rank 0] Watchdog caught collective operation timeout #9

Open
jiangds2018 opened this issue Apr 21, 2023 · 0 comments
Open

Comments

@jiangds2018
Copy link

jiangds2018 commented Apr 21, 2023

单机多卡训练过程中出现accelerate ProcessGroupNCCL 超时问题。

当前使用训练命令:
torchrun --nproc_per_node 8 -m hcpdiff.train_colo --cfg cfgs/train/examples/fine-tuning.yaml

训练配置:
base: [cfgs/train/train_base.yaml, cfgs/train/tuning_base.yaml]

      unet:
        -
          lr: 1e-6
          layers:
            - '' # fine-tuning all layers in unet
      
      ## fine-tuning text-encoder
      text_encoder:
       - lr: 1e-6
         layers:
           - ''
      
      tokenizer_pt:
        train: null
      
      train:
        gradient_accumulation_steps: 1
        save_step: 100
      
        scheduler:
          name: 'constant_with_warmup'
          num_warmup_steps: 50
          num_training_steps: 600
      
      model:
        pretrained_model_name_or_path: '/home/jovyan/data-vol-polefs-1/sd-webui/model/stable-diffusion-v1-5'
        tokenizer_repeats: 1
        ema_unet: 0
        ema_text_encoder: 0
      
      data:
        batch_size: 4
        prompt_template: 'prompt_tuning_template/object.txt'
        caption_file: null
        cache_latents: True
        tag_transforms:
          transforms:
            - _target_: hcpdiff.utils.caption_tools.TagShuffle
            - _target_: hcpdiff.utils.caption_tools.TagDropout
              p: 0.1
            - _target_: hcpdiff.utils.caption_tools.TemplateFill
              word_names: {}
        bucket:
          _target_: hcpdiff.data.bucket.RatioBucket.from_files # aspect ratio bucket
          img_root: '/home/jovyan/data-vol-polefs-1/sd-webui/Data/GOODONES/'
          target_area: {_target_: "builtins.eval", _args_: ['512*512']}
          num_bucket: 10
      
      data_class: null

报错内容:

      [E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801455 milliseconds before timing out.
      [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
      [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
      terminate called after throwing an instance of 'std::runtime_error'
        what():  [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=106, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801455 milliseconds before timing out.
      WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102175 closing signal SIGTERM
      WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102176 closing signal SIGTERM
      WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102177 closing signal SIGTERM
      WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102178 closing signal SIGTERM
      WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102179 closing signal SIGTERM
      WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102180 closing signal SIGTERM
      WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 102181 closing signal SIGTERM
      ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 102174) of binary: /home/jovyan/miniconda3/envs/sd-webui2/bin/python
      Traceback (most recent call last):
        File "/home/jovyan/miniconda3/envs/sd-webui2/bin/torchrun", line 8, in <module>
          sys.exit(main())
        File "/home/jovyan/miniconda3/envs/sd-webui2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
          return f(*args, **kwargs)
        File "/home/jovyan/miniconda3/envs/sd-webui2/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
          run(args)
        File "/home/jovyan/miniconda3/envs/sd-webui2/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
          elastic_launch(
        File "/home/jovyan/miniconda3/envs/sd-webui2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
          return launch_agent(self._config, self._entrypoint, list(args))
        File "/home/jovyan/miniconda3/envs/sd-webui2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
          raise ChildFailedError(
      torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
      =======================================================
      hcpdiff.train_colo FAILED
      -------------------------------------------------------
      Failures:
        <NO_OTHER_FAILURES>
      -------------------------------------------------------

=======================================================

额外内容:
报错发生在训练进行中卡住大概半小时后报错。

训练日志如下:
2023-04-21 17:46:45.897 | INFO | hcpdiff.train_ac:init:59 - world_size: 8
2023-04-21 17:46:45.897 | INFO | hcpdiff.train_ac:init:60 - accumulation: 1
2023-04-21 17:49:29.190 | INFO | hcpdiff.train_ac:build_data:265 - len(train_dataset): 352
2023-04-21 17:54:20.501 | INFO | hcpdiff.train_ac:train:338 - ***** Running training *****
2023-04-21 17:54:20.502 | INFO | hcpdiff.train_ac:train:339 - Num batches each epoch = 11
2023-04-21 17:54:20.504 | INFO | hcpdiff.train_ac:train:340 - Num Steps = 600
2023-04-21 17:54:20.504 | INFO | hcpdiff.train_ac:train:341 - Instantaneous batch size per device = 4
2023-04-21 17:54:20.505 | INFO | hcpdiff.train_ac:train:342 - Total train batch size (w. parallel, distributed & accumulation) = 32
2023-04-21 17:54:20.506 | INFO | hcpdiff.train_ac:train:343 - Gradient Accumulation steps = 1
2023-04-21 17:54:51.571 | INFO | hcpdiff.train_ac:train:363 - Step [20/600], LR_model: 1.28e-05, LR_word: 0.00e+00, Loss: 0.14078
2023-04-21 17:55:14.095 | INFO | hcpdiff.train_ac:train:363 - Step [40/600], LR_model: 2.56e-05, LR_word: 0.00e+00, Loss: 0.10552
2023-04-21 17:55:35.954 | INFO | hcpdiff.train_ac:train:363 - Step [60/600], LR_model: 3.20e-05, LR_word: 0.00e+00, Loss: 0.12115
2023-04-21 17:55:57.702 | INFO | hcpdiff.train_ac:train:363 - Step [80/600], LR_model: 3.20e-05, LR_word: 0.00e+00, Loss: 0.14846

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant