You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torch Distributed Data Parallel with ccl backend failed for torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu while working on torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu
#53
Open
XinyuYe-Intel opened this issue
Nov 9, 2023
· 0 comments
I use transformers Trainer to finetune LLM by Distributed Data Parallel with ccl backend, when I use torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu, it will fail like above image. But when I use torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu, it worked well.
I use transformers Trainer to finetune LLM by Distributed Data Parallel with ccl backend, when I use torch 2.1.0+cpu and oneccl-bind-pt 2.1.0+cpu, it will fail like above image. But when I use torch 2.0.1+cpu and oneccl-bind-pt 2.0.0+cpu, it worked well.
The script I used is https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/examples/finetuning/instruction/finetune_clm.py, command is:
Can you help investigate this issue?
The text was updated successfully, but these errors were encountered: