You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently experimenting with native torch float8 distributed training using the delayed scaling recipe on GPT 1.5B with DDP at batch=12 seq=1024 on an HGX 8xH100 (700W H100 SXM 80G SKU).
Currently, I am running into a DDP + torch.compile + float8 bug. Without enabling torch.compile it don't run into this error. I have tried using #1306 as well as main@latest Attached below is a self contained reprod & the Error Trace.
Hi @OrenLeung , I also repro this. We haven't worked on enabling float8 + compile + DDP yet as we found that FSDP is significantly more common in jobs which are large enough to benefit from float8 training. Wondering if you are open to FSDP with NO_SHARD instead of DDP? Context: https://discuss.pytorch.org/t/difference-between-ddp-vs-fsdp-no-shard/209729
Hi Torch Team,
I am currently experimenting with native torch float8 distributed training using the delayed scaling recipe on GPT 1.5B with DDP at batch=12 seq=1024 on an HGX 8xH100 (700W H100 SXM 80G SKU).
Currently, I am running into a
DDP
+torch.compile
+float8
bug. Without enabling torch.compile it don't run into this error. I have tried using #1306 as well as main@latest Attached below is a self contained reprod & the Error Trace.Commands
Error Trace
Reprod Script
Torch Versions
The text was updated successfully, but these errors were encountered: