You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently reshape3d_alltoall for N ranks runs N packing and N unpacking kernels respectively before and after the MPI_Alltoall. As rank count grows, the overhead of launching and waiting on those kernels grows linearly with N. In sufficiently regular cases, the loop over ranks in heffte::reshape3d_alltoall::apply_base() can be lowered into the device kernel. I have working SYCL code that does that and shows a clear performance improvement even for small N. Is this an optimization you'd consider incorporating if I contribute it?
The text was updated successfully, but these errors were encountered:
As a general rule, everything that improves performance is of consideration and probably should be included. If you want, you can point me to the prototype for the code before you bother making a formal PR and you can even just give me the kernel so I can do the integration with the rest and other backends.
On the other hand, I don't recommend running so many nodes with so little data-per-node that the kernel launch will cause issues, but then again, it is a valid use case.
If you merge the for-loop into the kernel, then each iteration of the loop will manage different amount of data which in itself can lead to performance issues. This is precisely why I didn't do it using CUDA, and calling one packing kernel at a time makes it easier to pipeline packing and sending. I can certainly see how the SYCL logic will be easier to generalize (hopefully without loss of performance) and all-to-all doesn't pipeline, so we could have a boost of performance here.
Currently
reshape3d_alltoall
for N ranks runs N packing and N unpacking kernels respectively before and after theMPI_Alltoall
. As rank count grows, the overhead of launching and waiting on those kernels grows linearly with N. In sufficiently regular cases, the loop over ranks inheffte::reshape3d_alltoall::apply_base()
can be lowered into the device kernel. I have working SYCL code that does that and shows a clear performance improvement even for small N. Is this an optimization you'd consider incorporating if I contribute it?The text was updated successfully, but these errors were encountered: