-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpi4py CI failures on main, v5.0.x #12940
Comments
It is worth noting that the test program is spawning on The error message is actually coming from the TCP BTL: https://github.com/open-mpi/ompi/blob/main/opal/mca/btl/tcp/btl_tcp_endpoint.c#L588-L590 This suggests that there might be some kind of race condition happening in the BTL disconnect teardown. |
I would first suggest reverting #12920 |
Look at the failure - it had nothing to do with that PR. You'll also have seen other PRs failing mpi4py at that time, with these same random failures - one was just fixing a typo in the docs. Still, you are welcome to revert and try again. |
i don't know, just look at https://github.com/open-mpi/ompi/pulls?q=is%3Apr+is%3Aclosed and the on PR that jumps out as being merged in with mpi4py failures on main is #12920 |
Correct - we didn't merge the others. They are still sitting there with errors, going nowhere. |
FWIW: I cannot get that test to run at all with head of PMIx/PRRTE master branches - it immediately segfaults with bad returned data from the "lookup" operation. If I change Can't comment on the correctness of the test - but I suspect OMPI's dpm code really cannot handle the scenario of all parent procs calling with "comm_self". All our tests to-date have had one parent proc (rank=0) doing the spawn. 🤷♂️ Just noting the difference. Probably not something I'll further pursue as the plan is to remove the pub/lookup operation anyway. |
Interesting - decided to add a check for NULL return of the lookup data so this test wouldn't segfault and found that it runs perfectly fine with head of PMIx/PRRTE master branches and with I then found that I can make the entire thing work without oversubscription if I simply add a 1 second delay in the loop over Someone who cares could investigate the root cause of the NULL return. Could be in OPAL's pmix_base_exchange function or in the PRRTE data server. There is a timeout in there, so it could be that stress causes the timeout to fire - or maybe we don't clean up a timeout event fast enough and it erroneously fires. 🤷♂️ |
I added a little debug and found that OMPI is calling publish/lookup (via the So I added a test case for PMIx that simply does a tight loop over Publish/Lookup between two procs to try and emulate what OMPI is doing - and it works perfectly for as many iterations as I care to run it. So this appears to be something wrong in the OMPI side, probably in the "nextcid" black hole. Afraid I have to leave that to you guys. |
i'll take a look at this. |
FWIW: the test program won't compile on either of my systems. You cannot have a variable in the declaration for the "commands" array. Minor nit. |
for fun and frolic i tried this with v4.1.x and get this error
|
looks like this may be related to #10895 |
As the title says, we've been seeing some mpi4py CI failures on
main
andv5.0.x
recently.C reproducer
I've managed to reproduce the spawn test failures locally on my mac. The problem is that they're non-deterministic. 🙁
I've written a short C reproducer. It only seems to trip the error — sometimes! — when we run a bunch of Comm spawns in a single process.
Compile and run it with:
If I run this a few times, it will definitely fail at least once.
Supplemental detail
Sometimes the mpi4py tests all succeed (!). Sometimes one the spawn tests randomly fails.
If you want to see the failure in the original mpi4py test suite, the good news is that there is a pytest command to rapidly re-run just the spawn tests. I find that this command fails once every several iterations:
The
-k CommSpawn
is the selector — it runs any test that includes CommSpawn in the name (I think it's case sensitive...?). This ends up only being 16 tests (out of the entire mpi4py test suite) and when it succeeds, it only takes 2-3 seconds.Here's a sample output from an mpi4py test that fails (it's not always this test):
The text was updated successfully, but these errors were encountered: