Is it any plan for overlap and fuse GEMM for Jax? #320

MoFHeka · 2023-07-14T13:35:31Z

By combining with Alpa?

nouiz · 2023-07-14T14:45:34Z

TE project doesn't plan to add more parallelism then what each frameworks already support.
JAX already support many parallelism.
PAXml also add some support for pipelining parallelism.
There is no plan to merge Alpa (or other similar project) with TE.

TE try to support as many parallelism it can. We are actively working on that for JAX.
You can also look at Rosetta, our supported model: https://github.com/NVIDIA/JAX-Toolbox/
It is small right now, but it will grow.

MoFHeka · 2023-07-21T10:08:21Z

A few days ago I asked core developers of Alpa——Yonghao Zhuang and Hao Zhang, and they told me that compiler optimization for DL model is no longer suitable for this era, and asked me to ask Nvidia. In fact, for example, alpa's pipeline parallelism is difficult to integrate with Jax, and sharding constraint in Jax using to support sequence parallelism is difficult to integrate with Jax. Their final recommendation was not to use Alpa/Jax without TPU.

MoFHeka · 2023-07-21T10:14:16Z

But if we just use Megatron framework, It has a lot of limitations. So it's there any roadmap for more framework supporting, and more technique? Such as https://arxiv.org/abs/2105.05720.

nouiz · 2024-01-27T18:42:05Z

Quick update, we are adding SP in this PR:
#602

We changed at the end of last year how TE try to parallelize. It was using xmap(so hardcoding some cases), not it use custom_partitioning. So now, all TE operations should ack as native XLA operations and should respect uses of with_sharding_constraint().

This way, end users should be able to trigger all SPMD parallelism only by setting the input/output sharding or by adding with_sharding_constraint() at the right place.

The PR above, make it even simpler for SP.
I'll close this issue. If you don't think we have the proper building blocks, re-open it.
If you have specific request, open new ones.

Note, for the computation/communication overlap, this is works that is started in XLA. TE/JAX can't control that. There is some XLA_FLAGS that allow to enable more or play with some configuration options. Models in JAX-Toolbox use some of them for speed up. We are hoping to enable more of those cases by default over the year.

MoFHeka · 2024-01-28T09:15:00Z

@nouiz I have seen this update, thank you for your work.
Another thing, do you have any plans to optimize the layer kernel of Praxis? JAX(Flax or Praxis) attention layers was constructed by Einsum kernels, which couldn't' be lowered to cudnn GEMM and the latest cudnn XLA FMHA kernel. When running attention layers in GPU, it could be only transformed to triton kernel...
TE currently only supports a limited number of transformer models (such as MOE is difficult to support) and does not yet support LORA SFT. So it may be necessary to optimize the layer composition of the Jax ecosystem.

MoFHeka · 2024-01-29T15:25:50Z

The last question was same as NVIDIA/JAX-Toolbox#502

nouiz added the jax label Jan 27, 2024

MoFHeka closed this as completed Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it any plan for overlap and fuse GEMM for Jax? #320

Is it any plan for overlap and fuse GEMM for Jax? #320

MoFHeka commented Jul 14, 2023

nouiz commented Jul 14, 2023

MoFHeka commented Jul 21, 2023

MoFHeka commented Jul 21, 2023

nouiz commented Jan 27, 2024

MoFHeka commented Jan 28, 2024

MoFHeka commented Jan 29, 2024

Is it any plan for overlap and fuse GEMM for Jax? #320

Is it any plan for overlap and fuse GEMM for Jax? #320

Comments

MoFHeka commented Jul 14, 2023

nouiz commented Jul 14, 2023

MoFHeka commented Jul 21, 2023

MoFHeka commented Jul 21, 2023

nouiz commented Jan 27, 2024

MoFHeka commented Jan 28, 2024

MoFHeka commented Jan 29, 2024