Does the transformation support MPI parallalization? #597

sijiehuang23 · 2024-12-03T04:48:10Z

sijiehuang23
Dec 3, 2024

Hi,

I'm wondering if FINUFFT supports MPI parallelization? I have a MPI-parallelized PDE solver (written in python) that I like to incorporate FINUFFT (if possible) for interpolation in physical space. I went through the document, seems like it only supports multi-threading (such as OpenMP), instead of MPI?

ahbarnett · 2024-12-05T20:36:09Z

ahbarnett
Dec 5, 2024
Maintainer

Dear SJH, No, we do not provide MPI, since we do not know of applications with problem sizes exceeding one shared-mem node (which can do up to M=N=1e9), or since bigger HPC applications that we know of only call FINUFFT locally on each node as part of their distributed code. But many large NUFFT problems are easily parallelizable (by summing outputs from separate FINUFFT instances on each node) in at least two ways: over NU pts, and over Fourier modes (the latter requires rephasing of inputs). However, if both M and N are too large for a single node, and you want to retain the quasi-linear scaling O(M + N ln N) then you'd need an MPI spreader/interp, distributing the fine-grid array, and an MPI FFT (which exists, eg PFFT). We don't have a lot of pressure to create the former - I don't have a sense of how needed it is. I am not an MPI user myself. Let us know your problem sizes (M,N, tol, dim, etc..) and maybe that will provide motivation. Best, Alex

…

On Mon, Dec 2, 2024 at 11:49 PM SJH ***@***.***> wrote: Hi, I'm wondering if FINUFFT supports MPI parallelization? I have a MPI-parallelized PDE solver (written in python) that I like to incorporate FINUFFT (if possible) for interpolation. I went through the document, seems like it only supports multi-threading (such as OpenMP), instead of MPI? — Reply to this email directly, view it on GitHub <#597>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSQJQMUOAFGDJKADGCT2DUZZ7AVCNFSM6AAAAABS45BN2OVHI2DSMVQWIX3LMV43ERDJONRXK43TNFXW4OZXGYYDCNRQGQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

2 replies

sijiehuang23 Dec 7, 2024
Author

Hi @ahbarnett ,

Thanks for the reply.

My code solves for incompressible Navier-Stokes equations using pseudo-spectral Fourier-Galerkin method (so the grid is uniform in each direction). My problem size for now is $256^3\sim10^7$, which is doable on a single node for the cluster that is available to me (128 CPUs/node).

My problem at hand is trying to interpolate the resulting velocity field at $\mathcal{O}(10^{6})$ spatial locations in physical space. If I were to adopt the current FINUFFT, seems like I need to gather all distributed velocity in Fourier space to a full array on a single core and do nufft3d3 (if I'm not picking the wrong method) for interpolation in physical space. This gathering step could potentially add significant overheads.

I'm new to NUFFT, so any suggestions are appreciated!

lu1and10 Dec 7, 2024
Maintainer

seems like I need to gather all distributed velocity in Fourier space to a full array on a single core and do nufft3d3 (if I'm not picking the wrong method) for interpolation in physical space. This gathering step could potentially add significant overheads.

You probably can't avoid the mpi all to all communication anyway if you already distribute your velocity in physical space and Fourier space arrays. While if you are using one single node, you may try MPI Shared Memory Programming using MPI_Win_allocate_shared(https://docs.open-mpi.org/en/v5.0.x/man-openmpi/man3/MPI_Win_allocate_shared.3.html) so that all the processes on the same node can access the same shared memory space. You may allocate the shared memory for the array of velocity in Fourier space, it may save you some memory movement time, while it requires change your current code and it's not scaleable to distributed nodes.

If you use the MPI shared memory on single node, now all the processes have access to your shared memory Fourier mode array, each process could run one finufft call to evaluate the velocity at the spatial locations in physical space for that process. Assuming you make each mpi process binds to one cpu core, you will need to use the single threaded finufft on each process.

Alternatively you could make the physical space velocity and locations also with MPI shared memory, then just call one finufft globally(say only call finufft from rank 0). Note if you have more than 2 mpi processes, mpirun by default binds each mpi process to one socket, you could set the finufft option nthreads to the number of cores binded to rank 0 to utilize omp. You may also try launching mpi with --bind-to none to see if rank 0 can use all the cores with finufft on the single node. Not sure if bind to none will slow your whole mpi program or not.

ahbarnett · 2024-12-09T17:27:39Z

ahbarnett
Dec 9, 2024
Maintainer

Dear Sijie,

Thanks Libin for the detailed MPI ideas. I will bring it back to basics:

N and M are as in the documentation, which you should read :)
Now, your problem size is not large, so I don't know why MPI would help you.
You also will not benefit much beyond using 16 threads, since the DRAM bandwidth becomes the limiting factor.
Eg, on my (ryzen2 5700U) laptop, targeting 6-digit accuracy, 4 threads is the best:

(base) alex@ross /home/alex/numerics/finufft> OMP_NUM_THREADS=4 test/finufft3d_test 256 256 256 1e6 1e-6 0 2 1.25
test 3d type 1:
     1000000 NU pts to (256,256,256) modes in 1.11 s 	8.98e+05 NU pts/s
	one mode: rel err in F[94,66,-99] is 2.35e-08
test 3d type 2:
     (256,256,256) modes to 1000000 NU pts in 0.623 s 	1.61e+06 NU pts/s
	one targ: rel err in c[500000] is 9.19e-07
test 3d type 3:
	1000000 NU to 16777216 NU in 6.68 s         	2.66e+06 tot NU pts/s
	one targ: rel err in F[8388608] is 4.7e-06

This was using FFT=DUCC, which is faster for 3D. Also not upsampfac=1.25 not 2.0.
You should run such tests on your machine.
The above tells us many things.
Eg, avoid type 3 if possible. You mention 3d3 but you should be using 3d2 if you're interpolating from regular grid to scattered points.
On a xeon, I the 3d type 2 takes 0.5 sec at 16 threads (1 thread per core).

What is your time budget? You see without MPI your task completes in about 0.5 sec.

Best, Alex

1 reply

sijiehuang23 Dec 9, 2024
Author

Yes, I've been testing these transforms, and I can see that they are extremely fast even with a single thread on my machine, and I believe this will not be the computational bottleneck for my case (and thanks for the clarification for using type 2; I also figured that out over the past weekend).

The only reason I asked about MPI is that my PDE code runs in MPI, and I'd like to incorporate finufft into the code and perform interpolation on the fly while the code solving for PDEs. In other words, I'm not look for using MPI to speedup finufft. As I mentioned and also by @lu1and10, it looks like the only way, for now, is to perform MPI all to all communication, gather all data to a single core and then perform type 2 transform on that one.

ahbarnett · 2024-12-09T19:21:53Z

ahbarnett
Dec 9, 2024
Maintainer

Ok, good. All-to-all will not be the best (but moving your 256^3*16 = 0.25GB should take <0.1sec), but maybe for single-node you can try the contiguous RAM layout that Libin explained. For bigger problems, scaling multinode on MPI, would be to have separate single-thread 3d2 transforms, each working with a subset of the array, then you add (or merge) the answers. Exactly how that is done depends on your layout (is the real space grid distributed over nodes, or the Fourier space grid?). Best, Alex

…

On Mon, Dec 9, 2024 at 12:43 PM SJH ***@***.***> wrote: Yes, I've been testing these transforms, and I can see that they are extremely fast even with a single thread on my machine, and I believe this will not be the computational bottleneck for my case (and thanks for the clarification for using type 2; I also figured that out over the past weekend). The only reason I asked about MPI is that my PDE code runs in MPI, and I'd like to incorporate finufft into the code and perform interpolation *on the fly* while the code solving for PDEs. In other words, I'm not look for using MPI to speedup finufft. As I mentioned and also by @lu1and10 <https://github.com/lu1and10>, it looks like the only way, for now, is to perform MPI all to all communication, gather all data to a single core and then perform type 2 transform on that one. — Reply to this email directly, view it on GitHub <#597 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSSF2TIVVIRXVSUC7RT2EXJDVAVCNFSM6AAAAABS45BN2OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNJRGEZTANI> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

1 reply

sijiehuang23 Dec 9, 2024
Author

For bigger problems, scaling multinode on MPI, would be to have separate single-thread 3d2 transforms, each working with a subset of the array, then you add (or merge) the answers. Exactly how that is done depends on your layout (is the real space grid distributed over nodes, or the Fourier space grid?).

The solution (for the PDE) arrays (in both physical and fourier space) are distributed. So is it doable (simply add/merge the results as you mentioned)?

ahbarnett · 2024-12-09T19:50:26Z

ahbarnett
Dec 9, 2024
Maintainer

You have to write out the math of the type 2 transform, to see how to handle separate blocks of Fourier space. Eg, you need to multiply the outputs by a phase if the offset of your Fourier grid is not the usual one (-N/2 <= k < N/2). This would allow smaller FFTs on each process. You could then add the outputs. Alternatively you could split by NU points, but then each process has to do a full-size FFT. This only makes sense if M >> N (which I think is not your case since M=1e6 for you but N>1e7).

…

On Mon, Dec 9, 2024 at 2:40 PM SJH ***@***.***> wrote: For bigger problems, scaling multinode on MPI, would be to have separate single-thread 3d2 transforms, each working with a subset of the array, then you add (or merge) the answers. Exactly how that is done depends on your layout (is the real space grid distributed over nodes, or the Fourier space grid?). The solution (for the PDE) arrays (in both physical and fourier space) are distributed. So is it doable (simply add/merge the results as you mentioned)? — Reply to this email directly, view it on GitHub <#597 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACNZRSSVFZTZJ3M6NTYLNKT2EXW35AVCNFSM6AAAAABS45BN2OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNJRGIZTKMI> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

-- *-------------------------------------------------------------------~^`^~._.~' |\ Alex Barnett Center for Computational Mathematics, Flatiron Institute | \ http://users.flatironinstitute.org/~ahb 646-876-5942

1 reply

sijiehuang23 Dec 9, 2024
Author

Okay, thanks. I'll try to do that.

BTW, does finufft works with real FFT? That is, the uniform-grid array in Fourier space is obtained from rfftn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the transformation support MPI parallalization? #597

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Does the transformation support MPI parallalization? #597

sijiehuang23 Dec 3, 2024

Replies: 4 comments · 5 replies

ahbarnett Dec 5, 2024 Maintainer

sijiehuang23 Dec 7, 2024 Author

lu1and10 Dec 7, 2024 Maintainer

ahbarnett Dec 9, 2024 Maintainer

sijiehuang23 Dec 9, 2024 Author

ahbarnett Dec 9, 2024 Maintainer

sijiehuang23 Dec 9, 2024 Author

ahbarnett Dec 9, 2024 Maintainer

sijiehuang23 Dec 9, 2024 Author

sijiehuang23
Dec 3, 2024

Replies: 4 comments 5 replies

ahbarnett
Dec 5, 2024
Maintainer

sijiehuang23 Dec 7, 2024
Author

lu1and10 Dec 7, 2024
Maintainer

ahbarnett
Dec 9, 2024
Maintainer

sijiehuang23 Dec 9, 2024
Author

ahbarnett
Dec 9, 2024
Maintainer

sijiehuang23 Dec 9, 2024
Author

ahbarnett
Dec 9, 2024
Maintainer

sijiehuang23 Dec 9, 2024
Author