Adding megablox gmm standalone #6940

miladm · 2024-04-18T22:55:48Z

In this PR, we add megablox kernel. The current implementation adds a new file megablox_gmm. I plan to merge it to custom_kernel with the rest of kernels.

test/run_tests.sh

miladm · 2024-05-04T01:40:10Z

cc @tgale96 for review.

test/test_megablox.py

tgale96 · 2024-05-07T13:30:39Z

torch_xla/experimental/megablox/gmm.py

+GroupMetadata = Any  # TODO(enriqueps): Clean this up and use a namedtuple
+
+
+def _make_group_metadata(


I wonder if you could trace the metadata function we have in the library with the GMM to avoid duplicating this tricky bit of code? If not this is fine, just curious.

Tracing doesn't seem to be an option AFAIK - though, it would be great if I we found a way to call the jax implementation of this method and make this whole implementation leaner. I suggest we do it as a follow up PR.

cc @alanwaketan

Why can't we just use the JAX version?

When we use the jax version to do this compute. Let's make sure we pass jax CPU tensors such that this part of compute can be done in cpu instead. As far as I can tell only group_sizes is used and it's 1D, so should be pretty lightweight to compute. We should also benchmark this against the reference_gmm in case this part drastically increase the tracing time. On that matter, we should cache the result.

Actually we can't do this given the group_sizes is data produced in the middle of the graph. And it means we need to do a graph break.

wonjoolee95 · 2024-05-07T23:01:01Z

Picking this up, now rebased from master to fix the conflicts. This PR should be ready to be reviewed/merged, I'll run the TPU CI to verify one more time.

test/test_megablox.py

torch_xla/experimental/megablox/gmm.py

torch_xla/experimental/megablox/tpu_features.py

wonjoolee95 · 2024-05-08T23:40:17Z

@JackCaoG thanks for the comments, this should be ready for another round of review.

JackCaoG

I added the TPUCI tag and rerun the CI. Feel free to merge once v4 test passed.

alanwaketan · 2024-05-09T00:12:01Z

Hopefully, I can take a look tomorrow after going over all the reading materials w.r.t MoE and megablocks. If I couldn't get to it, feel free to land it as it is. We can always follow up.

wonjoolee95 · 2024-05-10T00:49:39Z

Given that the CI (+ TPU CI) is green, I'll go ahead and merge this. I'll follow-up with any fixes if needed.

alanwaketan

@wonjoolee95 Do you think we can make a follow up PR to simplify this before moving to tgmm?

alanwaketan · 2024-05-16T00:44:15Z

torch_xla/experimental/megablox/gmm.py

+    lhs: torch.Tensor,
+    rhs: torch.Tensor,
+    group_sizes: torch.Tensor,
+    preferred_element_type: torch.dtype = torch.float32,


I think we should omit preferred_element_type, tiling, group_offset, existing_out, transpose_rhs and interpret parameters unless we know the users for sure need that.

alanwaketan · 2024-05-16T00:44:53Z

torch_xla/experimental/megablox/gmm.py

+  return (group_offsets, group_ids, m_tile_ids), num_tiles
+
+
+def _zero_uninitialized_memory(


Why can't we just use the JAX version?

alanwaketan · 2024-05-16T00:45:09Z

torch_xla/experimental/megablox/gmm.py

+GroupMetadata = Any  # TODO(enriqueps): Clean this up and use a namedtuple
+
+
+def _make_group_metadata(


Why can't we just use the JAX version?

alanwaketan · 2024-05-16T00:45:18Z

torch_xla/experimental/megablox/gmm.py

+import numpy as np
+
+
+def _validate_args(


Why can't we just use the JAX version?

alanwaketan · 2024-05-16T00:46:05Z

torch_xla/experimental/megablox/common.py

@@ -0,0 +1,22 @@
+"""Common utilities for Pallas kernels."""


This file can be deleted if we directly use the helper from JAX.

alanwaketan · 2024-05-16T00:46:46Z

torch_xla/experimental/megablox/__init__.py

@@ -0,0 +1 @@
+from .gmm import gmm


Once we remove all the duplicated code. We can move this method back to custom_kernel.py.

alanwaketan · 2024-05-16T00:47:43Z

test/test_megablox.py

+  from jax.experimental import pallas as pl
+
+
+class MegabloxTest(unittest.TestCase):


Why can't we merge this to test_pallas.py?

alanwaketan · 2024-05-16T00:53:30Z

torch_xla/experimental/megablox/gmm.py

+  group_offset_torch = torch.from_numpy(np.array(group_offset)).to("xla")
+  output_shape = torch.Size([m, n])
+  out = torch_xla._XLAC._xla_tpu_custom_call([
+      num_active_tiles, group_metadata0, group_metadata1, group_metadata2,


We should only duplicate the logic to get us these parameters. Anything else can be removed.

alanwaketan · 2024-05-16T00:55:12Z

torch_xla/experimental/megablox/gmm.py

+      group_offset_torch, lhs, rhs
+  ], payload, [output_shape], [preferred_element_type])
+
+  if existing_out is None and num_current_groups < num_total_groups:


As far as I can tell, this is only needed after we have expert parallelism. I still cannot tell if we can get there so far.

alanwaketan · 2024-05-16T01:05:50Z

test/test_megablox.py

+
+class MegabloxTest(unittest.TestCase):
+
+  def _reference_gmm(


Can just do it in torch instead of np?

alanwaketan · 2024-05-16T01:07:26Z

test/test_megablox.py

+      start += group_sizes[i]
+    return np.array(np.concatenate(out, axis=0))
+
+  def _group_sizes_strategy(self, m: int, num_groups: int) -> torch.Tensor:


As far as I can tell, for us, we just need to make sure our piping is correct and we don't need to ensure gmm itself is correct. That's JAX's job. So, let's remove this and pick one or two cases that are tuned to our wrapper.

alanwaketan · 2024-05-16T01:07:54Z

test/test_megablox.py

+    starts = np.concatenate([np.zeros(1, dtype=np.int32), ends_no_final])
+    return torch.from_numpy(ends - starts).to(torch.int32)
+
+  def _tolerances(self, lhs_dtype: torch.dtype, rhs_dtype: torch.dtype,


If we use torch, and we don't need this. We can just torch.allclose.

alanwaketan · 2024-05-16T01:08:29Z

test/test_megablox.py

+      return 1e-3, 1e-2  # atol, rtol
+    return 1e-4, 1e-2  # atol, rtol
+
+  LutFn = Callable[[int, int, int], Optional[tuple[int, int, int]]]


What's this?

alanwaketan · 2024-05-16T01:08:57Z

test/test_megablox.py

+
+  LutFn = Callable[[int, int, int], Optional[tuple[int, int, int]]]
+
+  def _init_test_cases(self):


We might not need all of these.

alanwaketan · 2024-05-23T00:18:42Z

test/test_megablox.py

+
+      lhs = torch.rand(m, k, dtype=lhs_dtype).to('xla')
+      rhs = torch.rand(num_groups, k, n, dtype=rhs_dtype).to('xla')
+      group_sizes = self._group_sizes_strategy(m=m, num_groups=num_groups)


This is a CPU tensor!!!!!!!!!!!!!!!!!!!!!!!!

alanwaketan · 2024-05-23T00:19:21Z

test/test_megablox.py

+      lhs = torch.rand(m, k, dtype=lhs_dtype).to('xla')
+      rhs = torch.rand(num_groups, k, n, dtype=rhs_dtype).to('xla')
+      group_sizes = self._group_sizes_strategy(m=m, num_groups=num_groups)
+      out = megablox.gmm(lhs, rhs, group_sizes)


We always output fp32 in this test case regardless of the input dtypes....

Summary: This is an effort to refactor the code from #6940 and aims to remove useless code in that part. It reduces the amount of code from ~400 lines to ~50 lines. However, a bummer is the original gmm kernel doesn't work at all... It assumes groups_sizes is a cpu tensor. That means we need to materialize this input in order to use this gmm kernel, and that will introduce graph breaks in the computation. I will need yet another follow up to make this code actually functional... Good news is the test cases seem functional, yay... Test Plan: python test/test_megablox.py

miladm marked this pull request as draft April 18, 2024 22:55

miladm self-assigned this Apr 23, 2024

miladm added the pallas label Apr 23, 2024

wonjoolee95 self-requested a review May 1, 2024 00:04

miladm marked this pull request as ready for review May 3, 2024 05:00

miladm requested a review from alanwaketan May 3, 2024 05:01

JackCaoG reviewed May 3, 2024

View reviewed changes

test/run_tests.sh Outdated Show resolved Hide resolved

miladm requested a review from tgale96 May 6, 2024 18:52

tgale96 reviewed May 7, 2024

View reviewed changes

tgale96 approved these changes May 7, 2024

View reviewed changes

wonjoolee95 force-pushed the megablox branch from b15097e to c320810 Compare May 7, 2024 22:59

JackCaoG reviewed May 7, 2024

View reviewed changes

test/test_megablox.py Show resolved Hide resolved

JackCaoG reviewed May 7, 2024

View reviewed changes

test/test_megablox.py Outdated Show resolved Hide resolved

JackCaoG reviewed May 7, 2024

View reviewed changes

torch_xla/experimental/megablox/gmm.py Outdated Show resolved Hide resolved

JackCaoG reviewed May 7, 2024

View reviewed changes

torch_xla/experimental/megablox/gmm.py Outdated Show resolved Hide resolved

JackCaoG reviewed May 7, 2024

View reviewed changes

torch_xla/experimental/megablox/gmm.py Outdated Show resolved Hide resolved

JackCaoG reviewed May 7, 2024

View reviewed changes

torch_xla/experimental/megablox/gmm.py Outdated Show resolved Hide resolved

JackCaoG reviewed May 7, 2024

View reviewed changes

torch_xla/experimental/megablox/tpu_features.py Outdated Show resolved Hide resolved

miladm added 10 commits May 8, 2024 22:09

adding megablox gmm standalone

5abcb18

added jax impl of make_group_metadata and some cleanup

a9c8653

working float32 implementation

6250bef

added get_payload() method

816b717

fixed support for bfloat16 input types.

2112d7e

megablox gmm with tests passing locally

6cd4bca

run the test

6a7e786

init

c5ba27b

library fix

90c850a

linter

36b7ac1

miladm and others added 7 commits May 8, 2024 22:09

adding unittest api

e416985

adding unittest api

7b50136

moving to tpu ci tests only

351097e

format fixes

b50a371

moving the tracing out of test and into gmm.py as an API

3cc89d3

Run linter

b06f06e

Address comments

3cbc14f

wonjoolee95 force-pushed the megablox branch from c320810 to 3cbc14f Compare May 8, 2024 22:18

wonjoolee95 requested a review from JackCaoG May 8, 2024 23:40

JackCaoG added the tpuci label May 8, 2024

JackCaoG approved these changes May 8, 2024

View reviewed changes

Fix unit test failures

da821f5

wonjoolee95 merged commit 40f7e1f into master May 10, 2024
20 checks passed

alanwaketan reviewed May 16, 2024

View reviewed changes

alanwaketan reviewed May 23, 2024

View reviewed changes

alanwaketan mentioned this pull request May 23, 2024

[Pallas] Refactor the gmm kernel #7099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding megablox gmm standalone #6940

Adding megablox gmm standalone #6940

miladm commented Apr 18, 2024

miladm commented May 4, 2024

tgale96 May 7, 2024

miladm May 7, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 23, 2024

wonjoolee95 commented May 7, 2024

wonjoolee95 commented May 8, 2024

JackCaoG left a comment

alanwaketan commented May 9, 2024 •

edited

Loading

wonjoolee95 commented May 10, 2024

alanwaketan left a comment

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 16, 2024

alanwaketan May 23, 2024

alanwaketan May 23, 2024

		GroupMetadata = Any # TODO(enriqueps): Clean this up and use a namedtuple


		def _make_group_metadata(

		return (group_offsets, group_ids, m_tile_ids), num_tiles


		def _zero_uninitialized_memory(

		from jax.experimental import pallas as pl


		class MegabloxTest(unittest.TestCase):


		LutFn = Callable[[int, int, int], Optional[tuple[int, int, int]]]

		def _init_test_cases(self):

Adding megablox gmm standalone #6940

Adding megablox gmm standalone #6940

Conversation

miladm commented Apr 18, 2024

miladm commented May 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wonjoolee95 commented May 7, 2024

wonjoolee95 commented May 8, 2024

JackCaoG left a comment

Choose a reason for hiding this comment

alanwaketan commented May 9, 2024 • edited Loading

wonjoolee95 commented May 10, 2024

alanwaketan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alanwaketan commented May 9, 2024 •

edited

Loading