-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running with VR_MPI_REDUCE=OFF, VR_USE_HYDRO=ON crashes #56
Comments
tl;fr: the cause of this issue is a mismatch between the logic in the distribution of particles to worker ranks during HDF5 reading, and the amount of memory reserved on those ranks for receiving particle data. The workaround is to avoid using Before Particle data is read, VR figures out the amount of particles that will need to be read, how they will be distributed across ranks, and allocates the memory on each rank to hold this. This calculation yields different results when using VR_MPI_REDUCE=ON
VR_MPI_REDUCE=OFF
The main difference here is in how the MPI domain is divided, with different number of particles associated to each rank (578257/1248987 when Later on, the HDF5 reading code seems to read particle data in ~50MB chunks (312500 Particle objects), which are then sent to the worker rank that will process them. However, the number of chunks that are read and sent to the worker ranks doesn't seem to take into account the MPI decomposition information from above, which leads to the error with
This doesn't happen with
I will try to investigate this further and will post any further updates here. |
After further reading, I learned that the HDF5 reading functions inspect the MPI decomposition information (the global With MPINumInDomain(opt); On the other hand, with MPIDomainExtent(opt);
MPIDomainDecomposition(opt);
Nlocal=nbodies/NProcs*MPIProcFac;
Nmemlocal=Nlocal;
Nlocalbaryon[0]=nbaryons/NProcs*MPIProcFac;
Nmemlocalbaryon=Nlocalbaryon[0];
NExport=NImport=Nlocal*MPIExportFac; In particular, I tried a quick fix, basically resizing the original vector of Particles when more-than-expected particles are received via MPI; this however only moved the problem further down the line, as expected given that the particle counts are used in other contexts:
On a separate point, I still don't have a clear idea of what the effect of |
As described in #54 (comment) by @MatthieuSchaller:
This issue is to keep track of the last sentence. Indeed when running with -DVR_MPI_REDUCE=OFF the following crash happens:
The text was updated successfully, but these errors were encountered: