Does Float8Linear support Tensor Parallelism and Sequence Parallelism? #1198

zigzagcai · 2024-10-30T03:13:09Z

We know that Transformer_Engine has support for FP8 training with data parallel + tensor parallel + sequence parallel, https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/advanced_optimizations.html#Multi-GPU-training

However, when I tried to check with the source code of swap_linear_modules and Float8Linear, and the documentations/discussion of torchao, I can only see the support for FP8 with FSDP (as far as I know).

So, does torchao also has the support for tensor parallelism and sequence parallelism with FP8Linear?

Thanks!

The text was updated successfully, but these errors were encountered:

vkuzo · 2024-10-30T15:19:19Z

Hi @zigzagcai , we do support TP and SP implemented via DTensor (https://pytorch.org/docs/stable/distributed.tensor.parallel.html). We have designed the float8 APIs to be orthogonal to TP/SP, so for the user the workflow would be:

start with high precision model
convert parts of model to float8 (torchao.float8)
define distributed strategy for model (https://pytorch.org/docs/stable/distributed.tensor.parallel.html)

If you are ok with distributed communications to happen in high precision, then (2) and (3) are independent. If you are interested in doing the all-gathers in low precision, then (2) and (3) interact to coordinate on low precision casting.

Here is an e2e example of (2) and (3) interacting with all-gathers happening in low precision:

ao/test/float8/test_dtensor.py

Line 206 in 4f1fc4c

def _test_fp8_mlp_tensor_parallelism_base(

. If you were to take that example and replace Float8ColwiseParallel with ColwiseParallel, etc, then you'd get to a point where (2) and (3) are independent.

A full e2e training example of how all of this fits together is https://github.com/pytorch/torchtitan.

Let me keep this issue open to track adding more information to the README about how this works. Thanks for the question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Float8Linear support Tensor Parallelism and Sequence Parallelism? #1198

Does Float8Linear support Tensor Parallelism and Sequence Parallelism? #1198

zigzagcai commented Oct 30, 2024 •

edited

Loading

vkuzo commented Oct 30, 2024

Does Float8Linear support Tensor Parallelism and Sequence Parallelism? #1198

Does Float8Linear support Tensor Parallelism and Sequence Parallelism? #1198

Comments

zigzagcai commented Oct 30, 2024 • edited Loading

vkuzo commented Oct 30, 2024

zigzagcai commented Oct 30, 2024 •

edited

Loading