-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated torchrun instructions #2096
Updated torchrun instructions #2096
Conversation
@TJ-Solergibert more SLURM resources is something we plan on looking at very soon, so if you have some to offer that would be phenomenal! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the improvement! I think somewhere I’d still like one example of using the module itself (e.g. python -m
) as we have users that want to do so/call it in this fashion. (We have it documented you can do accelerate launch -m
so showing it here as an example of “what this means we’re doing would be good if we don’t have this already somewhere else)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the docs with regard to calling torchrun
directly. There are a few small issues, the rest LGTM.
I’d still like one example of using the module itself (e.g. python -m) as we have users that want to do so/call it in this fashion.
I just checked, python -m torchrun
does not actually work (anymore?), I had to call python -m torch.distributed.run
. It would probably be sufficient to mention once that the two calls are equivalent.
Co-authored-by: Benjamin Bossan <[email protected]>
Co-authored-by: Benjamin Bossan <[email protected]>
Co-authored-by: Benjamin Bossan <[email protected]>
Co-authored-by: Benjamin Bossan <[email protected]>
|
@TJ-Solergibert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I think we're almost ready to put this in now! However, since this will live in accelerate
let's please use accelerate launch
for the example scripts :)
(And ideally let's have this under examples/slurm
rather than examples/Slurm
, to keep in practice with the namings of files. Ideally using submit_multigpu.sh
and submit_multinode.sh
instead as well. Easier to tab-complete etc)
Co-authored-by: Zach Mueller <[email protected]>
Co-authored-by: Zach Mueller <[email protected]>
Co-authored-by: Zach Mueller <[email protected]>
Co-authored-by: Zach Mueller <[email protected]>
Done! Thank you for your comments, I hope this little update will be helpful! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me! Really appreciate you iterating with us over it, it will be a fantastic addition 🔥
A good follow up if you'd like is to add it to the example zoo doc (so it's in our documentation too), however I can do this for you :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from Zach's comments, this LGTM, thanks. I have no experience with SLURM, so I can't comment on the correctness of those instructions.
@muellerzr @BenjaminBossan Before merging the SLURM scripts I will run them to be sure that they work. I will do it in the next few days, as we had last week a breakdown in my facilities and now the system is very busy. I will add them to the model zoo! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @TJ-Solergibert for updating the docs wrt torchrun
and slurm
support 🚀! Left a suggestion.
@TJ-Solergibert how did your tests go? Successful/good to merge? 🤗 (Once Sourab's comments have been addressed) |
These weeks I have been quite busy and I did not want to merge the repository without first checking the correct functioning of the scripts. I have to finish reviewing the multi-node training case, I hope that this weekend I will be able to do the final testing. At the same time, I have identified an error when computing the evaluation with several nodes in a setup with a parallel filesystem. The error appears in a call to an |
Can you describe the fix here and I can upstream it internally to get back with you on where it’s best to put in a fix? |
Co-authored-by: Zach Mueller <[email protected]>
…ergibert/accelerate into UpdateExampleInstructions
Hi @muellerzr ! I've finally figured out what wasn't working in the multinode script:
Toni |
For the multinode evaluation problem: In short words, the class UnixFileLock(BaseFileLock):
"""
Uses the :func:`fcntl.flock` to hard lock the lock file on unix systems.
"""
def __init__(self, lock_file, timeout=-1, max_filename_length=None):
max_filename_length = os.statvfs(os.path.dirname(lock_file)).f_namemax
super().__init__(lock_file, timeout=timeout, max_filename_length=max_filename_length)
def _acquire(self):
open_mode = os.O_RDWR | os.O_CREAT | os.O_TRUNC
fd = os.open(self._lock_file, open_mode)
try:
- fcntl.flock(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
+ fcntl.lockf(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
except OSError:
os.close(fd)
else:
self._lock_file_fd = fd
return None
def _release(self):
# Do not remove the lockfile:
#
# https://github.com/benediktschmitt/py-filelock/issues/31
# https://stackoverflow.com/questions/17708885/flock-removing-locked-file-without-race-condition
fd = self._lock_file_fd
self._lock_file_fd = None
- fcntl.flock(fd, fcntl.LOCK_UN)
+ fcntl.lockf(fd, fcntl.LOCK_UN)
os.close(fd)
return None Applying this patch makes that a single process couldn't detect its own lock. To address that it is necessary to modify the def _check_all_processes_locks(self):
expected_lock_file_names = [
os.path.join(self.data_dir, f"{self.experiment_id}-{self.num_process}-{process_id}.arrow.lock")
for process_id in range(self.num_process)
]
- for expected_lock_file_name in expected_lock_file_names:
+ for expected_lock_file_name in expected_lock_file_names[1:]:
nofilelock = FileFreeLock(expected_lock_file_name)
try:
nofilelock.acquire(timeout=self.timeout)
except Timeout:
raise ValueError(
f"Expected to find locked file {expected_lock_file_name} from process {self.process_id} but it doesn't exist."
) from None
else:
nofilelock.release() In this line the FileLock class is defined. We could create a new class for parallel filesystems and select it via an environment variable? The same for the PD: Check point 3 of @panserbjorn comment, I didn't find any documentation related to this behavior but he is completely right! |
This reverts commit c3bef5c.
I'd probably include more detail in your open issue here, as that would be something that'd have to be handled by evaluate: huggingface/evaluate#481 |
@TJ-Solergibert aside from the issue is it okay to merge this guide? |
For sure! I'm working on a project of distributing training across several nodes with transformers, datasets, accelerate and slurm. When I finish, it could be interesting for the zoo examples! |
Hi, Thank you for your excellent work on implementing Accelerate with Slurm! I encountered some problems when using the script
I fixed this error by passing
I am wondering this error may related to pytorch/pytorch#65992 (comment). It says that —master_port and —master_addr are only relevant when —rdzv_backend=static. |
A new issue would be good. Ensure that you’ve tested with the latest accelerate though please before you do! |
Hi @wormyu, Exactly! It's necessary to add the --machine_rank flag as you indicate. I noticed a few days after the merge, but I'm glad to see that someone has checked it. Feel free to update it! Regarding what you mentioned about the port: torchrun recommends using the c10d backend. I think, as the error indicates, it has to do with the network configuration of your infrastructure. I recommend that you consult with the support team of your machine and they can tell you which ports you can use. In my case, each user could use the port of their |
Hi guys, Thank you so much for your quick responce, I just create a new issue #2246 to detailly describe my problem, I have tried |
What does this PR do?
Fixes #2095
I have been working with Accelerate for a couple of weeks, and one of the first resources I used to test the environment were these scripts. Personally, I have been using torchrun and realized that the documentation was incorrect. Now I am preparing a small guide to work with Accelerate on a cluster with Slurm. If it would be beneficial for the project, I could also add it!
@muellerzr
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.