torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu #53

XinyuYe-Intel · 2023-11-09T06:27:34Z

I use transformers Trainer to finetune LLM by Distributed Data Parallel with ccl backend, when I use torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu, it will fail like above image. But when I use torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu, it worked well.

The script I used is https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/finetuning/instruction/finetune_clm.py, command is:

mpirun  --host 172.17.0.2,172.17.0.3 -n 2 -ppn 1 -genv OMP_NUM_THREADS=48 python3 finetune_clm.py     --model_name_or_path mosaicml/mpt-7b-chat     --train_file alpaca_data.json  --bf16 False     --output_dir ./mpt_peft_finetuned_model     --num_train_epochs 1     --max_steps 3     --per_device_train_batch_size 4     --per_device_eval_batch_size 4     --gradient_accumulation_steps 1     --evaluation_strategy "no"     --save_strategy "steps"   --save_steps 2000     --save_total_limit 1     --learning_rate 1e-4      --logging_steps 1     --peft lora     --group_by_length True     --dataset_concatenation     --do_train     --trust_remote_code True     --tokenizer_name "EleutherAI/gpt-neox-20b"     --use_fast_tokenizer True     --max_eval_samples 64     --no_cuda --ddp_backend ccl

Can you help investigate this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu #53

torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu #53

XinyuYe-Intel commented Nov 9, 2023

torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu #53

torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu #53

Comments

XinyuYe-Intel commented Nov 9, 2023