-
Notifications
You must be signed in to change notification settings - Fork 536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No such file or directory: '.local_rank0_completed_autoresume' while doing multi-node pretraining using hf_casual_lm (LLAMA2) #504
Comments
Hi, are you on a shared file system perhaps? Wondering if the nodes are trying to write and read to the same location. |
Hi, I had this same issue and for me the problem is indeed the shared file system. I changed the Thanks in advance!
|
I meet the same issue. Do you find any other code need to be update? |
I ran a full fine tuning and everything seems to work, though not sure there aren´t any hidden things |
Thanks for flagging this. We're working to get a patch release out which will fix this, ideally within 1-2 days. We're still discussing the right solution though since it is a bit tricky for us to test this as we do not have shared filesystems in our clusters |
#629 This should hopefully address the above issues |
Closing based on this #629. Please reopen if you have issues! |
Environment
To reproduce
YAML config:
Commands
On Node 0
On Node 1
Error Snippet
Getting following error on the node having node_rank 1
Note: Single Node training works perfectly fine for the same yaml config.
The text was updated successfully, but these errors were encountered: