[FSDPv2] Support MultiSlice #7044

alanwaketan · 2024-05-10T01:17:59Z

Summary:
This pull request adds the multi-slice support for FSDPv2. Basically, the default setup is to use the dcn axis as the data axis, and it means we only do data parallel over multi-slices. In the future, we could also support FSDP over mutli-slices.

Test Plan:
PJRT_DEVICE=TPU python test/spmd/test_fsdp_v2.py

jonb377

LGTM, thanks Jiewen!

jonb377 · 2024-05-10T17:27:33Z

torch_xla/experimental/spmd_fully_sharded_data_parallel.py

@@ -24,6 +24,8 @@ def _prepare_spmd_partition_spec(param):
  # TODO: should we shard on the maximal dim for param? Then we need
  # another helper for the output.
  partition_spec[0] = "fsdp"
+  if extra_data_axis:
+    partition_spec[0] = ("fsdp", extra_data_axis)


We usually have this reversed for DCN - (extra_data_axis, 'fsdp'). The axes should be in order of increasing network intensity in the mesh, and the order in the partition spec will impact the sharding.

jonb377 · 2024-05-10T17:29:41Z

torch_xla/experimental/spmd_fully_sharded_data_parallel.py

      mesh: Optional[spmd.Mesh] = None,
      shard_output: Optional[Callable] = None,
      auto_wrap_policy: Optional[Callable] = None,
      auto_wrapper_callable: Optional[Callable] = None,
+      extra_data_axis: Optional[str] = None,


What do you think of calling it replica_axis instead of extra_data_axis?

I think replica_axis is too tied to the underneath technology while users may only be familiar with data parallel, fsdp, and tensor parallel.

jonb377 · 2024-05-10T17:47:28Z

test/spmd/test_fsdp_v2.py

+    xs.mark_sharding(x, mesh, (('data', 'fsdp'), None))
+    output = model(x)
+    # Make sure output are sharded.
+    annotation = '{devices=[4,1]0,2,1,3}'


This would be different than x's sharding - x should have [4,1]0,1,2,3 with the iota mesh. I left a comment below, I think we should reverse the order in _prepare_spmd_partition_spec.

My mistake. Thanks for pointing it out.

alanwaketan · 2024-05-10T19:20:55Z

@JackCaoG The TPU CI doesn't seem running even with the label.

JackCaoG · 2024-05-10T19:32:13Z

yea I think they only check the label when CI is being run. It is OK if you have any changes and repush it will run, otherwise we can let head to check.

alanwaketan · 2024-05-11T00:34:53Z

I'm landing it. If the master TPU CI breaks, let's deal with that later.

alanwaketan requested a review from jonb377 May 10, 2024 01:17

alanwaketan self-assigned this May 10, 2024

alanwaketan added 3 commits May 10, 2024 01:24

initial commit

5f8063a

Fix linters

aeb59b5

Add fsdpv2 test to tpu ci

76c7435

alanwaketan force-pushed the alanwaketan/fsdp2_ms branch from 088da28 to 76c7435 Compare May 10, 2024 01:24

JackCaoG added the tpuci label May 10, 2024

jonb377 approved these changes May 10, 2024

View reviewed changes

Fix comment

e1a5d76

alanwaketan merged commit 6f0b61e into master May 11, 2024
19 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDPv2] Support MultiSlice #7044

[FSDPv2] Support MultiSlice #7044

alanwaketan commented May 10, 2024

jonb377 left a comment

jonb377 May 10, 2024

jonb377 May 10, 2024

alanwaketan May 10, 2024 •

edited

Loading

jonb377 May 10, 2024

alanwaketan May 10, 2024

alanwaketan commented May 10, 2024

JackCaoG commented May 10, 2024

alanwaketan commented May 11, 2024

[FSDPv2] Support MultiSlice #7044

[FSDPv2] Support MultiSlice #7044

Conversation

alanwaketan commented May 10, 2024

jonb377 left a comment

Choose a reason for hiding this comment

jonb377 May 10, 2024

Choose a reason for hiding this comment

jonb377 May 10, 2024

Choose a reason for hiding this comment

alanwaketan May 10, 2024 • edited Loading

Choose a reason for hiding this comment

jonb377 May 10, 2024

Choose a reason for hiding this comment

alanwaketan May 10, 2024

Choose a reason for hiding this comment

alanwaketan commented May 10, 2024

JackCaoG commented May 10, 2024

alanwaketan commented May 11, 2024

alanwaketan May 10, 2024 •

edited

Loading