Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated torchrun instructions #2096

Merged
merged 21 commits into from
Nov 20, 2023

Conversation

TJ-Solergibert
Copy link
Contributor

What does this PR do?

Fixes #2095

I have been working with Accelerate for a couple of weeks, and one of the first resources I used to test the environment were these scripts. Personally, I have been using torchrun and realized that the documentation was incorrect. Now I am preparing a small guide to work with Accelerate on a cluster with Slurm. If it would be beneficial for the project, I could also add it!

@muellerzr

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr
Copy link
Collaborator

@TJ-Solergibert more SLURM resources is something we plan on looking at very soon, so if you have some to offer that would be phenomenal!

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement! I think somewhere I’d still like one example of using the module itself (e.g. python -m) as we have users that want to do so/call it in this fashion. (We have it documented you can do accelerate launch -m so showing it here as an example of “what this means we’re doing would be good if we don’t have this already somewhere else)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating the docs with regard to calling torchrun directly. There are a few small issues, the rest LGTM.

I’d still like one example of using the module itself (e.g. python -m) as we have users that want to do so/call it in this fashion.

I just checked, python -m torchrun does not actually work (anymore?), I had to call python -m torch.distributed.run. It would probably be sufficient to mention once that the two calls are equivalent.

examples/README.md Outdated Show resolved Hide resolved
examples/README.md Outdated Show resolved Hide resolved
examples/README.md Outdated Show resolved Hide resolved
examples/README.md Outdated Show resolved Hide resolved
examples/README.md Outdated Show resolved Hide resolved
examples/README.md Outdated Show resolved Hide resolved
TJ-Solergibert and others added 4 commits October 27, 2023 14:49
Co-authored-by: Benjamin Bossan <[email protected]>
Co-authored-by: Benjamin Bossan <[email protected]>
Co-authored-by: Benjamin Bossan <[email protected]>
Co-authored-by: Benjamin Bossan <[email protected]>
@TJ-Solergibert
Copy link
Contributor Author

I just checked, python -m torchrun does not actually work (anymore?), I had to call python -m torch.distributed.run. It would probably be sufficient to mention once that the two calls are equivalent.

python -m torch.distributed.launch is deprecated and will be removed in future. PyTorch suggest moving to torchrun, so I think it is more convenient to delete all references to torch.distributed.launch so as not to create confusion. I wouldn't spend too much time duplicating the documentation for both cases either, as even though the behavior of both launchers is similar, the flags are different.

@muellerzr
Copy link
Collaborator

@TJ-Solergibert launch is, but not run. Let's please include calling run via a module please :) (And I know it's not deprecated because it's what we use internally under the hood and call it as a module. We did switch from launch however)

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think we're almost ready to put this in now! However, since this will live in accelerate let's please use accelerate launch for the example scripts :)

(And ideally let's have this under examples/slurm rather than examples/Slurm, to keep in practice with the namings of files. Ideally using submit_multigpu.sh and submit_multinode.sh instead as well. Easier to tab-complete etc)

examples/Slurm/submit-multiGPU.sh Outdated Show resolved Hide resolved
examples/Slurm/submit-multinode.sh Outdated Show resolved Hide resolved
examples/README.md Outdated Show resolved Hide resolved
examples/README.md Outdated Show resolved Hide resolved
@TJ-Solergibert
Copy link
Contributor Author

Done! Thank you for your comments, I hope this little update will be helpful!

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me! Really appreciate you iterating with us over it, it will be a fantastic addition 🔥

A good follow up if you'd like is to add it to the example zoo doc (so it's in our documentation too), however I can do this for you :)

examples/README.md Outdated Show resolved Hide resolved
Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from Zach's comments, this LGTM, thanks. I have no experience with SLURM, so I can't comment on the correctness of those instructions.

@TJ-Solergibert
Copy link
Contributor Author

TJ-Solergibert commented Nov 6, 2023

@muellerzr @BenjaminBossan Before merging the SLURM scripts I will run them to be sure that they work. I will do it in the next few days, as we had last week a breakdown in my facilities and now the system is very busy. I will add them to the model zoo!

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @TJ-Solergibert for updating the docs wrt torchrun and slurm support 🚀! Left a suggestion.

examples/slurm/submit_multinode.sh Outdated Show resolved Hide resolved
@muellerzr
Copy link
Collaborator

@TJ-Solergibert how did your tests go? Successful/good to merge? 🤗 (Once Sourab's comments have been addressed)

@TJ-Solergibert
Copy link
Contributor Author

@TJ-Solergibert how did your tests go? Successful/good to merge? 🤗 (Once Sourab's comments have been addressed)

These weeks I have been quite busy and I did not want to merge the repository without first checking the correct functioning of the scripts. I have to finish reviewing the multi-node training case, I hope that this weekend I will be able to do the final testing.

At the same time, I have identified an error when computing the evaluation with several nodes in a setup with a parallel filesystem. The error appears in a call to an evaluate function but to correct it you have to modify a datasets class too. Where should I open the PR?

@muellerzr
Copy link
Collaborator

muellerzr commented Nov 15, 2023

Can you describe the fix here and I can upstream it internally to get back with you on where it’s best to put in a fix?

@TJ-Solergibert
Copy link
Contributor Author

TJ-Solergibert commented Nov 15, 2023

Hi @muellerzr ! I've finally figured out what wasn't working in the multinode script:

  • I've noticed that both the Python scripts and accelerate launch can't handle multiline command arguments. Thats why I've changed in the scripts the parser to parse_known_args so it skips the \ but now I'm not passing all the tests. I don't know if that change could be possible or not, but in the worst case it is as easy as removing multiline command arguments in the slurm scripts (We lose readability). I also clarify this issue in examples/slurm/submit_multinode.sh.
    Edit: Back to parse_args.
  • I added the main_process_port in the multinode training script and readme as if it is not specified it is passed to torch as None. Maybe this issue is fixed in current versions of accelerate and pytorch, but in our cluster we have versions 0.20.0 and 1.13 respectively, and are not able to update to newer versions (We will be able shortly with the new cluster)

Toni

@TJ-Solergibert
Copy link
Contributor Author

TJ-Solergibert commented Nov 15, 2023

For the multinode evaluation problem:
First of all, credit to @panserbjorn for this comment!

In short words, the datasets filelocks work with flock, but this way nodes can't get locks of other nodes in parallel filesystems. Fixing that issue is as simple as changing from flock to lockf.

class UnixFileLock(BaseFileLock):
    """
    Uses the :func:`fcntl.flock` to hard lock the lock file on unix systems.
    """

    def __init__(self, lock_file, timeout=-1, max_filename_length=None):
        max_filename_length = os.statvfs(os.path.dirname(lock_file)).f_namemax
        super().__init__(lock_file, timeout=timeout, max_filename_length=max_filename_length)

    def _acquire(self):
        open_mode = os.O_RDWR | os.O_CREAT | os.O_TRUNC
        fd = os.open(self._lock_file, open_mode)

        try:
-           fcntl.flock(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
+           fcntl.lockf(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
        except OSError:
            os.close(fd)
        else:
            self._lock_file_fd = fd
        return None

    def _release(self):
        # Do not remove the lockfile:
        #
        #   https://github.com/benediktschmitt/py-filelock/issues/31
        #   https://stackoverflow.com/questions/17708885/flock-removing-locked-file-without-race-condition
        fd = self._lock_file_fd
        self._lock_file_fd = None
-       fcntl.flock(fd, fcntl.LOCK_UN)
+       fcntl.lockf(fd, fcntl.LOCK_UN)
        os.close(fd)
        return None

Applying this patch makes that a single process couldn't detect its own lock. To address that it is necessary to modify the _check_all_processes_locks method of the EvaluationModule class in evaluate/module.py.

def _check_all_processes_locks(self):
    expected_lock_file_names = [
        os.path.join(self.data_dir, f"{self.experiment_id}-{self.num_process}-{process_id}.arrow.lock")
        for process_id in range(self.num_process)
    ]
-   for expected_lock_file_name in expected_lock_file_names:
+   for expected_lock_file_name in expected_lock_file_names[1:]:
        nofilelock = FileFreeLock(expected_lock_file_name)
        try:
            nofilelock.acquire(timeout=self.timeout)
        except Timeout:
            raise ValueError(
                f"Expected to find locked file {expected_lock_file_name} from process {self.process_id} but it doesn't exist."
            ) from None
        else:
            nofilelock.release()

In this line the FileLock class is defined. We could create a new class for parallel filesystems and select it via an environment variable? The same for the _check_all_processes_locks method.

PD: Check point 3 of @panserbjorn comment, I didn't find any documentation related to this behavior but he is completely right!

@muellerzr
Copy link
Collaborator

I'd probably include more detail in your open issue here, as that would be something that'd have to be handled by evaluate: huggingface/evaluate#481

@muellerzr
Copy link
Collaborator

@TJ-Solergibert aside from the issue is it okay to merge this guide?

@TJ-Solergibert
Copy link
Contributor Author

For sure! I'm working on a project of distributing training across several nodes with transformers, datasets, accelerate and slurm. When I finish, it could be interesting for the zoo examples!

@muellerzr muellerzr merged commit 427ef8b into huggingface:main Nov 20, 2023
24 checks passed
@wormyu
Copy link

wormyu commented Dec 12, 2023

Hi,

Thank you for your excellent work on implementing Accelerate with Slurm! I encountered some problems when using the script submit_multinode.sh. My first question is when testing in accelerate==0.21.0, running the script will give the following error:

  File "/home/hlv8980/.conda/envs/outlier/lib/python3.9/site-packages/accelerate/utils/launch.py", line 114, in prepare_multi_gpu_env
    setattr(args, "node_rank", int(args.machine_rank))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

I fixed this error by passing --machine_rank $SLURM_PROCID to accelerate launch. But other errors still exist. No matter which port I set, even randomly by $(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4)), it always reports the port already used by other process.

ConnectionError: Tried to launch distributed communication on port `11986`, but another process is utilizing it. 
Please specify a different port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in 
your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.

I am wondering this error may related to pytorch/pytorch#65992 (comment). It says that —master_port and —master_addr are only relevant when —rdzv_backend=static.
Should I open a new issue for this, or would you prefer to continue the discussion here? Maybe I can open a new issue and provide the detailed error log and NCCL INFO log, and the methods I have tried to solve the problem.

@muellerzr
Copy link
Collaborator

A new issue would be good. Ensure that you’ve tested with the latest accelerate though please before you do!

@TJ-Solergibert
Copy link
Contributor Author

Hi @wormyu,

Exactly! It's necessary to add the --machine_rank flag as you indicate. I noticed a few days after the merge, but I'm glad to see that someone has checked it. Feel free to update it!

Regarding what you mentioned about the port: torchrun recommends using the c10d backend. I think, as the error indicates, it has to do with the network configuration of your infrastructure. I recommend that you consult with the support team of your machine and they can tell you which ports you can use. In my case, each user could use the port of their $UID. You can also try using torchrun instead of accelerate launch; check this example.

@wormyu
Copy link

wormyu commented Dec 13, 2023

Hi guys,

Thank you so much for your quick responce, I just create a new issue #2246 to detailly describe my problem, I have tried accelerate launch and torchrun but all fail. I think what @TJ-Solergibert said might me correct that the issue might related to the network configuration of our cluster. But still I want to make sure I launch the job correctly and let people see the NCCL log because I am not quit understand that. Thanks again for helping me !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some notes on the examples [torchrun]
6 participants