How do you remat GSPMD inserted all-gathers? #25010

ptoulme-aws · 2024-11-20T18:41:07Z

Problem: I have some Jax code that does sequence parallel, so somewhat similar to this

activation = jax.lax.with_sharding_constraint(activation, NamedSharding(mesh, PartitionSpec('data', 'tensor', None))
activation = norm(activation)
activation =  jax.lax.with_sharding_constraint(activation, NamedSharding(mesh, PartitionSpec('None, 'tensor', None))
# I want to remat this one ^
activation = attention(activation)

I have tried everything I can to remat the activation directly before attention, including Jax policies, explicitly using jax checkpoint on that exact tensor, but nothing to seems to make it remat. The activation directly before attention is a GSPMD inserted all-gather on the sequence dimension (dim=0).

I ended up writing an XLA pass to rematerialize large all-gathers and submitted a PR. openxla/xla#19163

Question: Is this possible to do from Jax end or is my pass really needed?

The text was updated successfully, but these errors were encountered:

mattjj · 2024-11-20T19:22:28Z

Thanks for the question.

No, I don't think a new pass is needed.

As I understand it, the standard way to spell this is to us a remat policy to mark the with_sharding_constraint which induces the allgather as not-saveable. One way to do that would be to use save_only_these_names and to only name other arrays (that are either upstream of the allgather-inducing with_sharding_constraint, or downstream of the operations that use the output of attention). Following your snippet, that might look something like:

activation = jax.lax.with_sharding_constraint(activation, NamedSharding(mesh, PartitionSpec('data', 'tensor', None))
activation = checkpoint_name(norm(activation), 'scattered_activations')
activation =  jax.lax.with_sharding_constraint(activation, NamedSharding(mesh, PartitionSpec('None, 'tensor', None))
activation = attention(activation)

together with a save_only_these_names policy that mentions 'scattered_activations' or something upstream of it.

Did you try something like that? If you already tried it, we should put together a minimal example to debug what's going on.

ptoulme-aws added the enhancement New feature or request label Nov 20, 2024

ptoulme-aws mentioned this issue Nov 20, 2024

[RematerializeLargeAllGather] Add pass to rematerialize large tensor parallel all-gathers. Allow configurable bytes. openxla/xla#19163

Open

mattjj self-assigned this Nov 20, 2024

mattjj added question Questions for the JAX team and removed enhancement New feature or request labels Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do you remat GSPMD inserted all-gathers? #25010

How do you remat GSPMD inserted all-gathers? #25010

ptoulme-aws commented Nov 20, 2024 •

edited

Loading

mattjj commented Nov 20, 2024

How do you remat GSPMD inserted all-gathers? #25010

How do you remat GSPMD inserted all-gathers? #25010

Comments

ptoulme-aws commented Nov 20, 2024 • edited Loading

mattjj commented Nov 20, 2024

ptoulme-aws commented Nov 20, 2024 •

edited

Loading