-
Notifications
You must be signed in to change notification settings - Fork 487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix zero-1 bug for inferring local ranks #5936
Fix zero-1 bug for inferring local ranks #5936
Conversation
@YangFei1990 Can you provide some test cases to illustrate the problem? And how the fix fix the issue? Sorry but I don't quite follow your description. |
Hi @alanwaketan for the example in the description, say if user put the groups as |
That makes a lot of sense. I guess it's probably hard to make a test case for it unless you made a lot of mocking. I will just approve it. |
Co-authored-by: Fei <[email protected]>
Fix a bug how zero-1 optimizer infer the local ranks. Before this PR zero-1 is doing
self.local_rank = self.global_rank // len(self.sharding_groups)
, which might possibly introduce wrong results when distribution strategy is complicated. For example, assuming with PP=4, DP=8 on a single node, the DP groups will be like [[0-7], [8-15], [16-23], [24-31]]. However every rank in the same DP group will have same local rank for zero-1.The fix is to use the existing sharding groups to infer the local rank, i.e. find the index of current rank in the sharding group that holds the rank.