Communication and compute on separate Streams do not overlap #64

garrett361 · 2024-05-28T13:55:01Z

Cross-posting this issue from ipex, in case the torch-ccl team is not aware of it.

Key issues:

Compute and collective communications do not overlap on intel GPU devices
Collectives block the host thread, rather than launching a kernel and immediately returning (as on NVIDIA devices)

The pytorch profiler traces highlight the issues (copied from the other thread):

A100 Trace

Non-blocking kernel launch and comms/compute overlap.

Blocking kernel launch and no comms/compute overlap.

See the other thread for more details.

The text was updated successfully, but these errors were encountered: