-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix test flakiness #388
base: master
Are you sure you want to change the base?
Fix test flakiness #388
Conversation
Have you been able to reproduce the bug in a local environment or are you still relying on CI to produce it? |
I'm able to reproduce locally but not as frequently as it occurs in CI (possibly just because CI runs 3x as many jobs). |
3d1b41b
to
2a99b06
Compare
Suggestion from @nothingmuch : Make the ohttp relay and payjoin directory port parameters optional, and simply find a free port during initialization if one is not provided. |
Found an issue that explains why sharing infrastructure across tests isn't working: tokio-rs/tokio#2374. The approach of sharing a global connection pool seems generally discouraged by the tokio maintainers... TL;DR:
|
Summarizing the problem: Each v2 test initializes a OHTTP relay and payjoin directory. These processes run on a random "free" port selected by I tried the following approaches, none of which fix the issue:
Other things I considered:
I've spent what feels like too many hours on this and am truly stumped... Happy to take another stab if anyone has new suggestions, but until then I'm putting this on the back burner. |
It can be painful to be stuck on something for a long time. Clearly what you've done so far is valuable and you've put lots of thought into the approach and potential remedies. I was wondering most why @nothingmuch's comment didn't work and took a stab at it myself (to no avail)
And the problem with this is that it's easy to get a deadlock passing both a handle and a port up from payjoin-directory. I'm sure there's a way to make this work, but my screwing around with it did not readily figure it out. The GET request in the wait_for_service_ready loop seemed never to be able to In the meantime, might you share any scripts you used to test the tests? So we can pick this up at a time of rest? @0xBEEFCAF3 just shared today how stoked he was on the clean cut-through interface. That took a long while to get right. This isn't the first time we've run into serious snags and I know we'll get through it. |
It pretty much amounts to this bad boy: export untilfail ()
{
count=0
while "$@"; do
(( count++ ))
echo "###### RUN COUNT: $count ######"
done && say "failed after $count runs"
} I run it like this and wait until the error occurs:
Note: on my old machine this fails reliably within ~40 runs - on my new/much faster machine it ~never fails. |
hacky fix: DanGould/pull/8 opening up some followup issues |
2a99b06
to
f090fe3
Compare
Pull Request Test Coverage Report for Build 12143130009Details
💛 - Coveralls |
Pulled @DanGould and @nothingmuch 's changes into this PR. I ran tests locally with those changes on my old machine for ~1 hour (178 run-throughs) with no failures. TODOs:
|
To do this more cleanly, i guess would suggest introducing some kind of #421 should probably be addressed as part of such cleanup as well |
629391c
to
834859c
Compare
834859c
to
7765327
Compare
ensure that the db: testcontainer::Container variable does not go out of scope while the directory is running. previously the directory task itself was awaited on by init_directory, whereas in the modified code it is instead returned as part of the result due to the different return value of listen_tcp_with_tls_on_free_port. this indirection de-coupled the db variable's lifetime from that of the directory, allowing it to go out of scope earlier than expected. Co-authored-by: Yuval Kogman <[email protected]>
7765327
to
4ee1faa
Compare
This must be a priority for test fixtures downstream in payjoin-ffi by binding to #422 so that we can replicate tests in bound implementations |
This change seems to reduce the frequency of the lock contention on TCP sockets in tests but doesn't fix it entirely and I'm not yet sure why. Leaving in Draft status while I debug further.