-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky test test_interaction_with_local_cache_dir #382
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #382 +/- ##
========================================
+ Coverage 93.7% 93.8% +0.1%
========================================
Files 22 22
Lines 1563 1563
========================================
+ Hits 1465 1467 +2
+ Misses 98 96 -2 |
This is super bizarre, and I'm wondering if there is an obscure bug in Python or the Python standard library. The Windows runner is using an old version of Python 3.8 (3.8.10 from May 2021, vs. latest 3.8.18 stable on other runners) |
New hypothesis. Here is an example path: I think maybe on different rigs we end up with race conditions up to So, I shortened the test name. Ran 3 times, all passed..... |
This was a good hypothesis, but from looking at the example from this issue, it seems like there is an incrementing integer that gets added after the truncation. So I think in theory we should be seeing |
Yeah, I wouldn't say I really have a handle on the actual root cause since I've still only ever been able to repro it on GHA and never locally on any OS. It seems the directory is getting deleted out from under the test—probably by another process. As you point out, it is supposed to increment, but maybe some interaction between that and the fact we do our tests with multiprocessing and that we delete the cache directory for the client within the test itself make it so the increment code checks for existence first and doesn't increment. E.g., if it checks the dir exists but got deleted already in the test (or emptied by the test and some other process is cleaning up empty temp dirs). Even multiprocessing as the source doesn't make a ton of sense since the The other test name related issue I thought about is that the paths get too long for some limit on the GH runners. At least one I checked was within ~15 characters of the Windows limit. Still, I would expect to be able to repro this on a Windows system then, and I wasn't. The final piece of weirdness from this exploration is that it worked when I had fewer combinations of Python versions / OSes, then started failing again when I added them all back in. The only other idea I have kicking around is that none of the debugging info we have gotten is actually diagnostic. It is due to some GHA resource availability glitch that happens somewhat randomly and any of the trends in the debugging runs is just noise. So, all that said, I'd say let's merge this and 🤞. If it somehow at least makes the tests more stable, that's a win. The number of passes in a row with this change is better than we've gotten in a while. |
Just to post some experimentation: def test_long_name_abcdefghijklmonpqrstuvwxyz(rig, tmpdir):
assert False
So because of the multiprocessing, it does look like the integer value on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't hurt, I guess. Maybe it would be helpful to put in a docstring with a more verbose description like the old test name though?
Ugh, for the record just saw this again with a different rig for the same tests. (logs below): Passing now with this outrageous commit: Weirdly, with all the reading/writing we do in the test suite, this particular test continues to crop up this error on Windows sometimes.
|
Tried to repro the failing test (e.g., here and here) on a Windows machine, but the test suite passes consistently for me across dozens of executions.
Here's the stack trace from the failure:
From the error messages, we can see that
to_path
is the one that doesn't exist yet. If it is a race condition, the existingsleep
in the tests is not at the right location for fixing the problem.Instead, we should sleep between
mkdir
andwrite_bytes
in the mock backends.Made the same fix for S3 and GS. Azure has different mock backend code here.
Hopefully this is a more stable fix.
I've been re-running the GH actions suite to test stability of this change. With a 1s total wait, we saw the flaky error on the third run. Increased to a 5s total wait and am testing again.