You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you are ok with distributed communications to happen in high precision, then (2) and (3) are independent. If you are interested in doing the all-gathers in low precision, then (2) and (3) interact to coordinate on low precision casting.
Here is an e2e example of (2) and (3) interacting with all-gathers happening in low precision:
. If you were to take that example and replace Float8ColwiseParallel with ColwiseParallel, etc, then you'd get to a point where (2) and (3) are independent.
We know that
Transformer_Engine
has support for FP8 training withdata parallel + tensor parallel + sequence parallel
, https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/advanced_optimizations.html#Multi-GPU-trainingHowever, when I tried to check with the source code of swap_linear_modules and Float8Linear, and the documentations/discussion of torchao, I can only see the support for FP8 with FSDP (as far as I know).
So, does
torchao
also has the support for tensor parallelism and sequence parallelism withFP8Linear
?Thanks!
The text was updated successfully, but these errors were encountered: