Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse packing/unpacking kernels for reshape3d_alltoall #24

Open
mabraham opened this issue May 12, 2023 · 1 comment
Open

Fuse packing/unpacking kernels for reshape3d_alltoall #24

mabraham opened this issue May 12, 2023 · 1 comment

Comments

@mabraham
Copy link
Contributor

Currently reshape3d_alltoall for N ranks runs N packing and N unpacking kernels respectively before and after the MPI_Alltoall. As rank count grows, the overhead of launching and waiting on those kernels grows linearly with N. In sufficiently regular cases, the loop over ranks in heffte::reshape3d_alltoall::apply_base() can be lowered into the device kernel. I have working SYCL code that does that and shows a clear performance improvement even for small N. Is this an optimization you'd consider incorporating if I contribute it?

@mkstoyanov
Copy link
Collaborator

As a general rule, everything that improves performance is of consideration and probably should be included. If you want, you can point me to the prototype for the code before you bother making a formal PR and you can even just give me the kernel so I can do the integration with the rest and other backends.

On the other hand, I don't recommend running so many nodes with so little data-per-node that the kernel launch will cause issues, but then again, it is a valid use case.

If you merge the for-loop into the kernel, then each iteration of the loop will manage different amount of data which in itself can lead to performance issues. This is precisely why I didn't do it using CUDA, and calling one packing kernel at a time makes it easier to pipeline packing and sending. I can certainly see how the SYCL logic will be easier to generalize (hopefully without loss of performance) and all-to-all doesn't pipeline, so we could have a boost of performance here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants