You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
We upgraded torch-ccl from 2021.1-beta07-1 to 1.10 and noticed some performance regression for all_to_all. overall, ccl 1.10 is 2x worse than 2021.1-beta07-1. system config:
single node, 2 proc_per_node, so no network communication
Any idea on the root cause?
all_to_all profiling for torch ccl 1.10
all_to_all profiling for torch ccl 2021.1-beta07-1
test code:
import torch
import extend_distributed as ext_dist
if __name__ == "__main__":
ext_dist.init_distributed(backend='ccl')
input = []
tensor = torch.ones(262144, 16, dtype=torch.bfloat16)
input.append(tensor)
with torch.autograd.profiler.profile(True) as prof:
for _ in range(10):
a2a_req = ext_dist.alltoall(input, None)
ly_sparse = a2a_req.wait()
print(prof.key_averages().table(sort_by="cpu_time_total"))
Hi,
We upgraded torch-ccl from 2021.1-beta07-1 to 1.10 and noticed some performance regression for all_to_all. overall, ccl 1.10 is 2x worse than 2021.1-beta07-1.
system config:
Any idea on the root cause?
all_to_all profiling for torch ccl 1.10
all_to_all profiling for torch ccl 2021.1-beta07-1
test code:
For
extend_distributed
, please refer to https://github.com/IntelAI/models/blob/master/models/recommendation/pytorch/dlrm/training/bfloat16/extend_distributed.pyThanks
The text was updated successfully, but these errors were encountered: