Gradient bucketing using a pre-defined bucket size cap #6417

amithrm · 2024-01-30T14:16:25Z

No description provided.

alanwaketan · 2024-01-30T19:35:16Z

Do you mind adding a test case?

amithrm · 2024-03-04T20:31:40Z

Added the test case and rebased @JackCaoG @alanwaketan

alanwaketan · 2024-03-04T23:19:44Z

torch_xla/core/xla_model.py

+      grad_bytes = grad.numel() * grad.element_size()
+
+      # Gradient is larger than bucket_cap, don't bucketize
+      if grad_bytes > bucket_cap:


Curious why you want to specialize this case?

if the grad_bytes (already in the tensor) is larger than bucket cap, we send it straight away as a single tensor instead of bucketing.

Right, I understood the logic. But why? Combining it with the bucket introduce some problems?

Yeah, looks like you can get rid of this if statement (until continue), and the "if total > bucket_cap" should take care of this condition when bucket is empty.

The issue with combining this with the rest is that the "buffer" allocated in the underlying runtime may not have enough space to fit this large tensor. The idea is to have a large buffer that can fit all the tensors. It can happen that total_bytes is just below the max allowed and this tensor if added to the bucket spills the maximum. Hence should go "alone" without bucketizing

See your concerns now !! Fixed the code flow

alanwaketan · 2024-03-04T23:20:37Z

torch_xla/core/xla_model.py

@@ -990,14 +1042,13 @@ def reduce_gradients(optimizer, groups=None, pin_layout=True):
  """
  count = xrt_world_size()
  if count > 1:
-    gradients = _fetch_gradients(optimizer)


Can we keep the original behavior? And maybe use a flag to turn this feature on?

OK..let me work on that

Maybe we should introduce an argument "bucket_cap_mb" that turns this on, instead of environmental variable? bucket_cap_mb=0 turns off bucketing and is the default?

jeffhataws · 2024-03-16T05:06:35Z

torch_xla/core/xla_model.py

+      # Bucketize till the total spills over
+      total += grad_bytes
+      if total > bucket_cap:
+        all_reduce(


Need to check "if len(tensor_bucket):" because tensor_bucket can be empty at the start, when grad_bytes > bucket_cap.

jeffhataws · 2024-03-18T16:25:21Z

torch_xla/core/xla_model.py

@@ -974,6 +976,56 @@ def wait_device_ops(devices=[]):
  torch_xla._XLAC._xla_wait_device_ops(devices=devices)


+def bucketed_allreduce(gradients):


Maybe name it similar to the original function all_reduce? How about all_reduce_bucketized?

Also, do you need to pass "groups" and "pin_layout" also?

alanwaketan

LGTM. Please address other comments as well.

jeffhataws · 2024-05-28T20:47:59Z

@JackCaoG do you know why the build failed with "ERROR: Error initializing RemoteModule"?

JackCaoG · 2024-05-28T20:53:29Z

It is on a fork hence can't use remote cache but there was a bug that it still try to query the credintical. I think we fixed this issue error today, it should start building without cache. If you rebase the CI should start running.

Summary: This pull request tries to unify all TORCH_LIBRARY definitions across torch_xla into one xla library. Test Plan: CI

jeffhataws · 2024-05-31T18:54:58Z

@JackCaoG looks like build is still failing for some reason after rebasing. Maybe another rebase is needed?

JackCaoG · 2024-05-31T20:00:31Z

The error still seems to be related with the fork. Let me grant both of you the write access, then you can open pr directly.

JackCaoG · 2024-05-31T20:01:24Z

OK I gave @amithrm write access

jeffhataws · 2024-06-07T16:51:08Z

Replaced by #7216 to avoid the build issues in CI testing.

JackCaoG requested a review from alanwaketan January 30, 2024 18:25

amithrm requested review from will-cromar, JackCaoG, yeounoh, mateuszlewko and stgpetrovic as code owners March 1, 2024 03:21

amithrm force-pushed the bucket_allreduce branch from 11f466d to fdb0f9e Compare March 1, 2024 21:56

alanwaketan reviewed Mar 4, 2024

View reviewed changes

jeffhataws reviewed Mar 16, 2024

View reviewed changes

jeffhataws reviewed Mar 18, 2024

View reviewed changes

alanwaketan approved these changes Mar 26, 2024

View reviewed changes

amithrm force-pushed the bucket_allreduce branch from 777b97f to 31dd451 Compare May 28, 2024 20:20

amithrm and others added 10 commits May 28, 2024 19:56

Gradient bucketing using a pre-defined bucket size cap

e6b6122

Adding test case

d2989c6

adding bucketed_all_reduce test

3a5f92f

Unify TORCH_LIBRARY definitions (pytorch#6455)

3a93868

Summary: This pull request tries to unify all TORCH_LIBRARY definitions across torch_xla into one xla library. Test Plan: CI

Gradient bucketing using a pre-defined bucket size cap

5ad177d

Fixing rebase

9dbc2a7

Fixing PR to simplify bucketing logic

3915b77

Fix linter issues

7c9b15d

Added ALLREDUCE_BUCKET_SIZE_MB to turn on bucketing for allreduce

f5ffcd4

Fix import

05e2367

amithrm force-pushed the bucket_allreduce branch from 31dd451 to 05e2367 Compare May 29, 2024 02:56

jeffhataws mentioned this pull request Jun 7, 2024

Enable bucketized all-reduce for gradients #7216

Merged

jeffhataws closed this Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient bucketing using a pre-defined bucket size cap #6417

Gradient bucketing using a pre-defined bucket size cap #6417

amithrm commented Jan 30, 2024

alanwaketan commented Jan 30, 2024

amithrm commented Mar 4, 2024

alanwaketan Mar 4, 2024

amithrm Mar 14, 2024

alanwaketan Mar 14, 2024

jeffhataws Mar 16, 2024 •

edited

Loading

amithrm Mar 22, 2024

amithrm Mar 22, 2024

alanwaketan Mar 4, 2024

amithrm Mar 14, 2024

jeffhataws Mar 16, 2024

amithrm Mar 23, 2024

jeffhataws Mar 16, 2024

jeffhataws Mar 18, 2024

alanwaketan left a comment

jeffhataws commented May 28, 2024

JackCaoG commented May 28, 2024 •

edited

Loading

jeffhataws commented May 31, 2024

JackCaoG commented May 31, 2024

JackCaoG commented May 31, 2024

jeffhataws commented Jun 7, 2024

		@@ -974,6 +976,56 @@ def wait_device_ops(devices=[]):
		torch_xla._XLAC._xla_wait_device_ops(devices=devices)


		def bucketed_allreduce(gradients):

Gradient bucketing using a pre-defined bucket size cap #6417

Gradient bucketing using a pre-defined bucket size cap #6417

Conversation

amithrm commented Jan 30, 2024

alanwaketan commented Jan 30, 2024

amithrm commented Mar 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffhataws Mar 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwaketan left a comment

Choose a reason for hiding this comment

jeffhataws commented May 28, 2024

JackCaoG commented May 28, 2024 • edited Loading

jeffhataws commented May 31, 2024

JackCaoG commented May 31, 2024

JackCaoG commented May 31, 2024

jeffhataws commented Jun 7, 2024

jeffhataws Mar 16, 2024 •

edited

Loading

JackCaoG commented May 28, 2024 •

edited

Loading