Fix zero-1 bug for inferring local ranks #5936

YangFei1990 · 2023-11-29T02:20:26Z

Fix a bug how zero-1 optimizer infer the local ranks. Before this PR zero-1 is doing self.local_rank = self.global_rank // len(self.sharding_groups), which might possibly introduce wrong results when distribution strategy is complicated. For example, assuming with PP=4, DP=8 on a single node, the DP groups will be like [[0-7], [8-15], [16-23], [24-31]]. However every rank in the same DP group will have same local rank for zero-1.

The fix is to use the existing sharding groups to infer the local rank, i.e. find the index of current rank in the sharding group that holds the rank.

alanwaketan · 2023-12-06T03:06:17Z

@YangFei1990 Can you provide some test cases to illustrate the problem? And how the fix fix the issue? Sorry but I don't quite follow your description.

YangFei1990 · 2023-12-13T18:28:40Z

Hi @alanwaketan for the example in the description, say if user put the groups as [[0-7], [8-15], [16-23], [24-31]], i.e. opt = ZeroRedundancyOptimizer(..., sharding_group=[[0-7], [8-15], [16-23], [24-31]], ...). With the existing local rank inferring method self.local_rank = self.global_rank // len(self.sharding_groups), for example global rank [0-7] will have all marked as local rank 0. However the local rank should be the rank of the group that the current rank belongs to, so the global rank [0-7] should also have local rank [0-7]. This PR is to address this issue by infer local rank from the index of its global rank in its sharding group.
If you want to add a test for this, could you share some guidance how could I add the test?

alanwaketan · 2023-12-14T02:12:13Z

That makes a lot of sense. I guess it's probably hard to make a test case for it unless you made a lot of mocking. I will just approve it.

Co-authored-by: Fei <[email protected]>

YangFei1990 added 2 commits November 28, 2023 18:10

fix zero-1 bug for inferring local ranks

f0dd2a1

formatting

cc417a8

JackCaoG requested a review from alanwaketan November 29, 2023 02:25

JackCaoG added fsdp backport_2.2 labels Nov 29, 2023

alanwaketan approved these changes Dec 14, 2023

View reviewed changes

alanwaketan merged commit 5577dd7 into pytorch:master Dec 14, 2023
17 checks passed

jeffhataws mentioned this pull request Dec 14, 2023

2.2 backport PR request list #6036

Open

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Fix zero-1 bug for inferring local ranks (pytorch#5936)

04569d0

ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Dec 15, 2023

Fix zero-1 bug for inferring local ranks (pytorch#5936)

115ec56

ManfeiBai pushed a commit to ManfeiBai/PyTorchXLA that referenced this pull request Dec 26, 2023

Fix zero-1 bug for inferring local ranks (pytorch#5936)

1190c6f

ManfeiBai added a commit that referenced this pull request Dec 27, 2023

[cherry-pick]Fix zero-1 bug for inferring local ranks (#5936) (#6237)

10d5269

Co-authored-by: Fei <[email protected]>

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Fix zero-1 bug for inferring local ranks (#5936)

533271f

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Fix zero-1 bug for inferring local ranks (#5936)

05f8751

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix zero-1 bug for inferring local ranks #5936

Fix zero-1 bug for inferring local ranks #5936

YangFei1990 commented Nov 29, 2023

alanwaketan commented Dec 6, 2023

YangFei1990 commented Dec 13, 2023

alanwaketan commented Dec 14, 2023

Fix zero-1 bug for inferring local ranks #5936

Fix zero-1 bug for inferring local ranks #5936

Conversation

YangFei1990 commented Nov 29, 2023

alanwaketan commented Dec 6, 2023

YangFei1990 commented Dec 13, 2023

alanwaketan commented Dec 14, 2023