Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi4py CI failures on main, v5.0.x #12940

Open
jsquyres opened this issue Nov 27, 2024 · 12 comments
Open

mpi4py CI failures on main, v5.0.x #12940

jsquyres opened this issue Nov 27, 2024 · 12 comments

Comments

@jsquyres
Copy link
Member

As the title says, we've been seeing some mpi4py CI failures on main and v5.0.x recently.

C reproducer

I've managed to reproduce the spawn test failures locally on my mac. The problem is that they're non-deterministic. 🙁

I've written a short C reproducer. It only seems to trip the error — sometimes! — when we run a bunch of Comm spawns in a single process.

// C version of an mpi4py test, blatently stolen and converted to C
// from
// https://github.com/mpi4py/mpi4py/blob/master/test/test_spawn.py#L205-L217

#include <stdio.h>
#include <mpi.h>

void do_child(MPI_Comm parent)
{
    MPI_Barrier(parent);
    MPI_Comm_disconnect(&parent);
}

void do_parent(char *argv[])
{
    const int count = 3;
    char *commands[count] = { argv[0], argv[0], argv[0] };
    int maxprocs[3] = { 1, 1, 1 };
    MPI_Comm child;
    int errcodes[3];
    MPI_Info infos[] = { MPI_INFO_NULL, MPI_INFO_NULL, MPI_INFO_NULL };
    MPI_Comm_spawn_multiple(count, commands, MPI_ARGVS_NULL,
                            maxprocs, infos, 0,
                            MPI_COMM_SELF, &child,
                            errcodes);
    
    int local_size, remote_size;
    MPI_Comm_size(child, &local_size);
    MPI_Comm_remote_size(child, &remote_size);
    MPI_Barrier(child);
    MPI_Comm_disconnect(&child);
    MPI_Barrier(MPI_COMM_SELF);

    if (local_size != 1) {
        printf("WARNING: local_size == %d, expected 1\n", local_size);
    }
    if (remote_size != count) {
        printf("WARNING: remote_size == %d, expected %d\n",
               remote_size, count);
    }
}

int main(int argc, char* argv[])
{
    MPI_Init(NULL, NULL);
    MPI_Barrier(MPI_COMM_SELF);

    MPI_Comm parent;
    MPI_Comm_get_parent(&parent);
    if (parent == MPI_COMM_NULL) {
        for (int i = 0; i < 32; ++i) {
            do_parent(argv);
        }
    } else {
        do_child(parent);
    }

    MPI_Barrier(MPI_COMM_SELF);
    MPI_Finalize();
    return 0;
}

Compile and run it with:

mpicc -g mpi4py-comm-spawn-defaults1.c -o mcsd
mpirun --mca rmaps_default_mapping_policy :oversubscribe -n 2 mcsd

If I run this a few times, it will definitely fail at least once.

Supplemental detail

Sometimes the mpi4py tests all succeed (!). Sometimes one the spawn tests randomly fails.

If you want to see the failure in the original mpi4py test suite, the good news is that there is a pytest command to rapidly re-run just the spawn tests. I find that this command fails once every several iterations:

mpirun --mca rmaps_default_mapping_policy :oversubscribe -n 2 python3 test/main.py -v -f -k CommSpawn

The -k CommSpawn is the selector — it runs any test that includes CommSpawn in the name (I think it's case sensitive...?). This ends up only being 16 tests (out of the entire mpi4py test suite) and when it succeeds, it only takes 2-3 seconds.

Here's a sample output from an mpi4py test that fails (it's not always this test):

testCommSpawnDefaults1 (test_spawn.TestSpawnMultipleSelfMany.testCommSpawnDefaults1) ... [JSQUYRES-M-4LRP:00000] *** An error occurred in Socket closed
[JSQUYRES-M-4LRP:00000] *** reported by process [1182269441,0]
[JSQUYRES-M-4LRP:00000] *** on a NULL communicator
[JSQUYRES-M-4LRP:00000] *** Unknown error
[JSQUYRES-M-4LRP:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[JSQUYRES-M-4LRP:00000] ***    and MPI will try to terminate your MPI job as well)
@jsquyres
Copy link
Member Author

jsquyres commented Nov 27, 2024

It is worth noting that the test program is spawning on MPI_COMM_SELF -- meaning that each of the 2 processes in MPI_COMM_WORLD are calling spawn_multiple (i.e., each launching 3 MPI processes). I have 12 cores, so running 8 processes is still undersubscribing my laptop, but it does indicate a lot of control messages flying around to setup and teardown two sets of 3 processes. Indeed, If I change the test to spawn on MPI_COMM_WORLD, the test fails far less frequently (i.e., less control messages flowing around). I don't know if this means anything, but it is an interesting datapoint.

The error message is actually coming from the TCP BTL: https://github.com/open-mpi/ompi/blob/main/opal/mca/btl/tcp/btl_tcp_endpoint.c#L588-L590

This suggests that there might be some kind of race condition happening in the BTL disconnect teardown.

@hppritcha
Copy link
Member

I would first suggest reverting #12920
and seeing if that helps. Looking at our merged PRs that one was merged in with an mpi4py fauilure.

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

Look at the failure - it had nothing to do with that PR. You'll also have seen other PRs failing mpi4py at that time, with these same random failures - one was just fixing a typo in the docs. Still, you are welcome to revert and try again.

@hppritcha
Copy link
Member

i don't know, just look at

https://github.com/open-mpi/ompi/pulls?q=is%3Apr+is%3Aclosed

and the on PR that jumps out as being merged in with mpi4py failures on main is #12920

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

Correct - we didn't merge the others. They are still sitting there with errors, going nowhere.

@rhc54
Copy link
Contributor

rhc54 commented Nov 27, 2024

FWIW: I cannot get that test to run at all with head of PMIx/PRRTE master branches - it immediately segfaults with bad returned data from the "lookup" operation. If I change MPI_COMM_SELF to MPI_COMM_WORLD, then everything runs fine - repeatedly.

Can't comment on the correctness of the test - but I suspect OMPI's dpm code really cannot handle the scenario of all parent procs calling with "comm_self". All our tests to-date have had one parent proc (rank=0) doing the spawn.

🤷‍♂️ Just noting the difference. Probably not something I'll further pursue as the plan is to remove the pub/lookup operation anyway.

@rhc54
Copy link
Contributor

rhc54 commented Nov 28, 2024

Interesting - decided to add a check for NULL return of the lookup data so this test wouldn't segfault and found that it runs perfectly fine with head of PMIx/PRRTE master branches and with MPI_COMM_SELF...so long as I allow oversubscription! If I don't explicitly permit oversubscription, then I hit the "NULL return" error that would otherwise cause the segfault.

I then found that I can make the entire thing work without oversubscription if I simply add a 1 second delay in the loop over do_parent. So this appears to just be a case of overwhelming the pub/lookup code path such that it winds up returning a NULL, and not having adequate error detection to deal with that case.

Someone who cares could investigate the root cause of the NULL return. Could be in OPAL's pmix_base_exchange function or in the PRRTE data server. There is a timeout in there, so it could be that stress causes the timeout to fire - or maybe we don't clean up a timeout event fast enough and it erroneously fires. 🤷‍♂️

@rhc54
Copy link
Contributor

rhc54 commented Nov 30, 2024

I added a little debug and found that OMPI is calling publish/lookup (via the opal_pmix_base_exchange function) multiple times for every spawn operation. Tracked it down to some strange operations going on in your "nextcid" code - looks like a true Rube Goldberg over there and I'm not going into that death spiral.

So I added a test case for PMIx that simply does a tight loop over Publish/Lookup between two procs to try and emulate what OMPI is doing - and it works perfectly for as many iterations as I care to run it.

So this appears to be something wrong in the OMPI side, probably in the "nextcid" black hole. Afraid I have to leave that to you guys.

@hppritcha hppritcha self-assigned this Dec 2, 2024
@hppritcha
Copy link
Member

i'll take a look at this.

@rhc54
Copy link
Contributor

rhc54 commented Dec 2, 2024

FWIW: the test program won't compile on either of my systems. You cannot have a variable in the declaration for the "commands" array. Minor nit.

@hppritcha
Copy link
Member

for fun and frolic i tried this with v4.1.x and get this error

(ompi-docs-venv) pn2308116:/Users/hpp/ompi/examples (v4.1.x)$mpirun --mca rmaps_default_mapping_policy :oversubscribe -n 2 ./issue_12940
[pn2308116][[13337,2],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (54)
[pn2308116][[13337,1],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (54)
[pn2308116][[13337,3],0][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (54)
[pn2308116][[13337,1],1][btl_tcp.c:559:mca_btl_tcp_recv_blocking] recv(18) failed: Connection reset by peer (54)

@hppritcha
Copy link
Member

looks like this may be related to #10895

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants