-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative densities / other crashes with latest master #53
Comments
@JBorrow do you have the git SHA of an earlier revision where this worked? |
Nothing helpful, the version I usually use is quite old. |
This is most probably the same issue underlying #37. It should in principle be the same problem, except now it's caught earlier and with a different error message (that has been added since #37 was reported to help picking up this issue earlier at runtime, and with a more meaningful explanation). Please check if the workaround described in the last messages in that ticket (disabling OpenMP-based FOF) removes the problem. Note also that this seems to be a problem only when compiling without MPI support ( @JBorrow when you say the version you usually use is quite old, and that in itself could be helpful. If you could report here what exact version was that it would help finding out the real underlying issue. |
I am not entirely sure it is a duplicate of #37 but it might. The reason I say this is that I get the code to run with fee3a9f but break on cb4336d. And both these are fairly recent versions. The error message is
If you need full info, the first one was compiled here:
Both cases were compiled with Intel 2020. Both have Input is
Happy to copy anything over to other places or give you more info is needed. |
@MatthieuSchaller interesting... if you say that a previous build worked and now it didn't then I introduced a regression. My guess is that the diagnosis is correct; i.e., there are particles with non-negative densities, leading to I'm happy to revert the change that adds the check for non-positive densities. That would remove this regression, but OTOH it could mean we are doing wrong calculations silently. |
Yes, agreed, it's not ideal. Just thought this might help track things down somehow. |
@JBorrow does the latest master without OMP but with MPI give you an alternative that is fast enough? |
@JBorrow @MatthieuSchaller please see #37 (comment) for a very likely potential fix for this problem. If that fix makes this issue disappear then we can close this ticket. |
I'm trying this out in |
I'm commenting here to to leave a trail: in #60 yet another instance of this problem has been reported. Details can be found throughout the comments, including build configuration details. |
Another instance of this issue with hydro runs was reported in #37 (comment) for a zoom, non-MPI execution. |
@MatthieuSchaller with the recent surge of more cases of errors due to negative potentials (a check I added as part of #37), I'm starting to think maybe I'll turn the check off, or at least turn it into a warning instead of an unrecoverable error, or even make this behavior configurable. In #53 (comment) I argued that invalid results might be end up being generated due to this, but otherwise this problem seems to be affecting it too many areas. Thoughts? Edit: I just double-checked with Pascal, and he confirmed this check is correct; i.e., all particles at this point of the code should have properly-defined densities. |
That would be a worry then. If we can get to that point and find particles that have negative densities it means something went wrong. I suppose if we now it only happens when OpenMP is on then we can have that issue here closed and taken over by the wider OMP issue. |
@MatthieuSchaller do you know of any (hopefully small) dataset + configuration file that can be used to consistently reproduce this error? I went through all the reported occurrences and either the original input and configuration files are gone, or the error occurred as part of a SWIFT + VR run, so I can't find an easy way to actually trigger it. I can try to reason a bit more about the code and try to advance like that, but having a small, reproducible example would be great. |
That's unfortunate. Let me dig out some older test case that was backed up. |
Sorry @rtobar; my config at /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs/Run_0 crashes. This just happened now, using the latest maser. |
Could you expand on the modules used and cmake flags used? |
|
Can you try disabling OpenMP? |
Yeah, it works fine if you run with MPI only. That's an okay fix for now - it's actually faster for my 25s - but the behaviour should not be different between this and OMP.
On 26 Feb 2021, at 13:24, Matthieu Schaller <[email protected]<mailto:[email protected]>> wrote:
[EXTERNAL EMAIL]
Can you try disabling OpenMP?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#53 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AB3Z6G6MUP4SENHG5W5KCOLTA6OHPANCNFSM4UNKJL6Q>.
|
Ok, that's interesting. Some of the cases above were breaking both with and without OMP. |
Okay - I was wrong - I get the negative densities in pure MPI only mode (/cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/Run_0, snapshot 0000). |
Do we have a way of knowing that the particle is or what object it belongs to to try to understand what the offending setup is? |
Yes, in hindsight I should have added more information to the error message. I'll do that and push a commit straight to the BTW, I tried replicating the issue with the data, settings and compilation flags used by @JBorrow, but without success so far. I tried in a local system with 20 cores, so tries a few combinations of OpenMP threads and MPI ranks (1x20, 3x6, 20x1) and none of them produced the error. Maybe I miseed something obvious so I'll re-try tomorrow, but otherwise this will be a bit harder to track down than anticipated. I also tried briefly to reproduce in cosma, but the queues were full and I was trying to go for an interactive job -- next time I'll just submit a job instead. |
Thanks. This is small enough to run on the login nodes if that helps (though I will deny having ever said that when questioned by the admin police...) |
Could it be the fact I'm using Intel 2018? |
Worth checking. I'd think the MPI implementation should make no difference but the compiler may come packed with a different version of OMP. |
A small update: with the smaller test case mentioned in #53 (comment), I found that the particle in question has I was also able to reproduce the error earlier in the code, so instead of it failing in |
Interesting. Could this particle be far from the centre of its object and hence be part of the spherical over-density but not of any actual sub-structure? |
The case |
I could finally dedicate some more time today to this problem. As mentioned previously in #53 (comment) some particles end up with Using the smaller reproducible example mentioned in #53 (comment) I've now observed how this plays out. When MPI is enabled density calculation is a two-step process:
Note that the extra overlapping checks and particle communication happens only if the The problem I've reproduced happens when during the second step particle information is exchanged, but a rank doesn't receive any particle information. In such cases the code explicitly skips any further calculations, and any I'm not really sure what's the best to do here. @pelahi, given the situation described above would you be able to point us in the right direction? This analysis is consistent with at least the MPI-enabled crash @JBorrow reported in the comment above. In there one can see:
The giveaway here is the
The crash in the non-MPI, OpenMP case might be yet a different problem. |
Hi @rtobar , I'll look into this later today. I think it is likely a simple logic fix. |
Thanks to @pelahi there is now a new commit on the With this patch I can now run the small defective test case until the end, meaning that all particles have their densities calculated. I tried also reproducing the last crash (MPI case) reported by @JBorrow and with the patch I can get past the original problem. Later on during the execution the code crashes again though, but seems like an unrelated issue (more like #71 or #73, but haven't dug on it). With this latest fix I feel fairly confident the underlying problem is finally gone for the MPI-enabled cases, but there still remains the MPI-disabled problem @JBorrow is still having, and that I think has happened in a few other places. So while I'll put the latest fix on the |
I like the sound of this! |
Thanks for your hard work! |
Just a quick update: I reproduced the MPI-disabled crash that @JBorrow experimented in cosma6 with the same dataset. Here's a small backtrace:
No surprises here, this is where the code has always been crashing. However with the latest changes that have been integrated into the |
Another clue here, from running on COSMA-8 (this has 128 cores/node). The code was far more likely to crash when running on all 128 (rather than the 32 that I resubmitted with) cores, in the MPI-only configuration, with this problem. |
Hi, new to the party, but I'm finding this behaviour for commit
|
Welcome to the party... The current "workaround" to just get somewhere with production runs is to toggle MPI and OMP on/off. One combination might be lucky... Apart from that, your simulation is likely quite small so could you give the directory on cosma, as well as config file? |
Thanks, configuring with MPI on seems to be working fine for me now.
yes the zooms seem to have a high hit-rate with this bug (3/3 of the last I've tried), the last one I was using is here for now: |
@james-trayford since the introduction of the "no positive density" check we have found at least three different points in the code that were causing this issue in different ways:
I'd assume your issue is the third, and that by using MPI you reduced the number of OpenMP threads on each rank, thus avoiding the issue. An alternative workaround (I think, not fully certain) would be to set |
Hi @james-trayford, I will have a look using your snapshots. Any chance you could put them somewhere I could access them? @MatthieuSchaller should I email Adrian to see if I can access cosma now that I have changed institutions? |
Do you still have access to gadi? I can copy things there, the setup is fairly small. |
Sadly no, but I can request access again.
…On Wed, 23 Jun 2021 at 14:40, Matthieu Schaller ***@***.***> wrote:
Do you still have access to gadi? I can copy things there, the setup is
fairly small.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#53 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3ZASYXDYP6FHLAYH7SRFLTUF6WNANCNFSM4UNKJL6Q>
.
|
Ok. Might be easier to revive your cosma account then. Feel free to email Adrian and cc me in. |
Is there anything I can do here to help with this issue? Tests? narrowing down of a use case? etc. |
Hi @james-trayford, could you provide the config options you ran with? I am not encountering the error but it could be something specific. How many MPI ranks and OMP threads did you run with? |
I have also recently just encountered this. DMO zoom simulation, no MPI. Has anyone made any progress with this? Can I help? |
@stuartmcalpine unfortunately no new progress has been made here. I think the best summary of the situation is #53 (comment), where a workaround (not a great one, but should work) is suggested. Older comments go into all the details on how this story has unfolded... In order to fix this problem the best would be to find a small, quickly-reproducible example. As mentioned in the comment, this seems to be a problem with domain decomposition on high CPU counts, but that's a guess. In any case having the full details of your failure would definitely be a gain. |
I am doing some zoom runs, no mpi, on-the-fly with swift. For these tests, cosma 8 and 128 threads. The DMO version segfaults on the 4th invocation, and the hydro version later, like the 10th. But the DMO is consistently failing at the same place. module load intel_comp/2018 intel_mpi/2018 fftw/3.3.7 cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-fPIC" -DVR_ZOOM_SIM=ON -DVR_MPI=OFF -DVR_MPI_REDUCE=OFF -DVR_USE_SWIFT_INTERFACE=ON .. Config file: vrconfig_3dfof_subhalos_SO_dmo.txt Last bit of log: [1726.081] [debug] search.cxx:3982 Getting Hierarchy 23 Line where it segfaults: |
@stuartmcalpine your crash actually looks like a different problem, not the one discussed in this GH issue -- but of course that doesn't mean it's not important to get it fixed. Could you open a new issue with the details please so it doesn't get lost? A descriptive title and a link to your comment above should suffice, there's no need to duplicate the whole text/attachments. For reference, this looks strikingly similar to an issue reported in #78 and fixed in 53c0289. It seems here the same situation is happening: |
Describe the bug
I've been trying to run the latest(ish) master of VR on some SWIFT outputs (on COSMA7), and I've been getting a couple of odd crashes.
To Reproduce
Version cb4336d.
Ran on snapshots under
/snap7/scratch/dp004/dc-borr1/new_randomness_runs/runs/Run_*
Log files
STDOUT:
STDERR:
Environment (please complete the following information):
cmake .. -DCMAKE_CXX_FLAGS="-O3 -march=native" -DVR_MPI=OFF -DVR_HDF5=ON -DVR_ALLOWPARALLELHDF5=ON -DVR_USE_HYDRO=ON
The text was updated successfully, but these errors were encountered: