Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix signal_file_path to avoid race condition #1253

Closed
wants to merge 2 commits into from

Conversation

ofivite
Copy link

@ofivite ofivite commented Jun 5, 2024

In case of using slurm with --array option, sometimes it happens that multiple tasks start running ~simultaneously. Since the filename of the lock doesn't take into account this parallelism, the tasks start creating/writing/deleting the same file, and create a race condition. Appending SLURM_JOB_ID to the filename seems to fix it.

@ofivite ofivite requested a review from a team as a code owner June 5, 2024 14:13
Copy link
Collaborator

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ofivite can you elaborate on why this is necessary? I'm not quite sure I understand how it has a race condition since only rank 0 should be writing this file

@mvpatel2000 mvpatel2000 requested review from irenedea and dakinggg June 5, 2024 14:22
@ofivite
Copy link
Author

ofivite commented Jun 5, 2024

@ofivite can you elaborate on why this is necessary? I'm not quite sure I understand how it has a race condition since only rank 0 should be writing this file

sure, it's not really ranks which write into the same file, but different slurm jobs which happen to start running at the same time on the cluster, e.g. in our case via an array of tasks from sbatch --array=0-31

@mvpatel2000
Copy link
Collaborator

For a more general solution, should we instead append some random 6-digit SHA? We could then replace slurm ID and node rank + future proof

@ofivite
Copy link
Author

ofivite commented Jun 5, 2024

In fact, the same situation apparently happens also in streaming when creating a shared memory here. I was just thinking to open the similar fix PR there too :)

@ofivite
Copy link
Author

ofivite commented Jun 5, 2024

For a more general solution, should we instead append some random 6-digit SHA? We could then replace slurm ID and node rank + future proof

yep, think it's better to generalise it that way!

@ofivite
Copy link
Author

ofivite commented Jun 5, 2024

Although I'm not quite sure, will each slurm task / rank have its own unique SHA in that case?

@mvpatel2000
Copy link
Collaborator

Hm I guess this doesn't work since you need each node to have the same path...

I guess your original solution is probably best 🤔. @dakinggg any other ideas?

@dakinggg
Copy link
Collaborator

dakinggg commented Jun 5, 2024

@ofivite Would attaching the run name as the unique id resolve your issue? Its technically not foolproof, because nothing forces two runs to use different run names, but probably still an improvement? The issue with slurm id is that it would only work for slurm (although maybe this is also still an improvement)

@janEbert
Copy link
Contributor

janEbert commented Jun 5, 2024

@dakinggg that would resolve the issue indeed. :)

I would still argue for including some other form of identifier into the name so that in the future, other people do not randomly run into this issue. There's still a potential race condition on systems that only initialize random seeds with system time, but if we ignore that, the name could simply be broadcasted from rank 0 to the connected processes using torch.distributed.

@janEbert
Copy link
Contributor

janEbert commented Jun 5, 2024

Note that if implementing the broadcasting version, the current code would then use the same path across all nodes (dist.get_node_rank() is not included in the path anymore), which does not agree with the previous version.

@dakinggg
Copy link
Collaborator

dakinggg commented Jun 5, 2024

Ok I think the right solution is to implement a utility in composer that gets a signal file name + a random id that gets broadcast from rank 0. That way it will be unique per usage. Would be happy to accept a contribution for this!

@dakinggg
Copy link
Collaborator

FYI, started working on this here: mosaicml/composer#3396. I'm gonna go ahead and close this PR, thanks for raising the issue!

@dakinggg
Copy link
Collaborator

Hey @janEbert , sorry for the delay. This should be fixed once mosaicml/composer#3485 and #1381 are merged.

@janEbert
Copy link
Contributor

Awesome, thanks all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants