Fix flaky test test_interaction_with_local_cache_dir #382

pjbull · 2023-11-24T17:59:45Z

Tried to repro the failing test (e.g., here and here) on a Windows machine, but the test suite passes consistently for me across dozens of executions.

Here's the stack trace from the failure:

tests\test_caching.py:221: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cloudpathlib\cloudpath.py:552: in open
    self._refresh_cache(force_overwrite_from_cloud=force_overwrite_from_cloud)
cloudpathlib\cloudpath.py:1120: in _refresh_cache
    self.download_to(self._local)
cloudpathlib\cloudpath.py:893: in download_to
    return self.client._download_file(self, destination)
cloudpathlib\s3\s3client.py:154: in _download_file
    obj.download_file(
tests\mock_clients\mock_s3.py:96: in download_file
    to_path.write_bytes(self.path.read_bytes())
C:\hostedtoolcache\windows\Python\3.8.10\x64\lib\pathlib.py:1245: in write_bytes
    with self.open(mode='wb') as f:
C:\hostedtoolcache\windows\Python\3.8.10\x64\lib\pathlib.py:1222: in open
    return io.open(self, mode, buffering, encoding, errors, newline,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/popen-gw0/test_interaction_with_local_ca0/bucket/L3YT2kNHDZHcaBmeV9uVJ4-test_caching-test_interaction_with_local_cache_dir/dir_0/file0_0.txt')
name = 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\popen-gw0\\test_interaction_with_local_ca0\\bucket\\L3YT2kNHDZHcaBmeV9uVJ4-test_caching-test_interaction_with_local_cache_dir\\dir_0\\file0_0.txt'
flags = 33665, mode = 438

    def _opener(self, name, flags, mode=0o666):
        # A stub for the opener argument to built-in open()
>       return self._accessor.open(self, flags, mode)
E       FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\popen-gw0\\test_interaction_with_local_ca0\\bucket\\L3YT2kNHDZHcaBmeV9uVJ4-test_caching-test_interaction_with_local_cache_dir\\dir_0\\file0_0.txt'

C:\hostedtoolcache\windows\Python\3.8.10\x64\lib\pathlib.py:1078: FileNotFoundError

From the error messages, we can see that to_path is the one that doesn't exist yet. If it is a race condition, the existing sleep in the tests is not at the right location for fixing the problem.

Instead, we should sleep between mkdir and write_bytes in the mock backends.

Made the same fix for S3 and GS. Azure has different mock backend code here.

Hopefully this is a more stable fix.

I've been re-running the GH actions suite to test stability of this change. With a 1s total wait, we saw the flaky error on the third run. Increased to a 5s total wait and am testing again.

github-actions · 2023-11-24T18:01:16Z

🚀 Deployed on https://deploy-preview-382--gallant-agnesi-5f7bb3.netlify.app

codecov · 2023-11-24T18:03:19Z

Codecov Report

Merging #382 (80ebc2c) into master (945a352) will increase coverage by 0.1%.
The diff coverage is n/a.

Additional details and impacted files

@@           Coverage Diff            @@
##           master    #382     +/-   ##
========================================
+ Coverage    93.7%   93.8%   +0.1%     
========================================
  Files          22      22             
  Lines        1563    1563             
========================================
+ Hits         1465    1467      +2     
+ Misses         98      96      -2

see 1 file with indirect coverage changes

jayqi · 2023-11-26T17:21:53Z

This is super bizarre, and I'm wondering if there is an obscure bug in Python or the Python standard library. The Windows runner is using an old version of Python 3.8 (3.8.10 from May 2021, vs. latest 3.8.18 stable on other runners)

pjbull · 2023-12-10T18:33:46Z

New hypothesis. Here is an example path:
/tmp/pytest-of-runner/pytest-0/popen-gw0/test_interaction_with_local_ca0/bucket/GZaejri6AmEPKPJCb6WNuC-test_caching-test_interaction_with_local_cache_dir/dir_0/file0_0.txt

I think maybe on different rigs we end up with race conditions up to /tmp/pytest-of-runner/pytest-0/popen-gw0/test_interaction_with_local_ca0/ so that dir gets deleted in the course of one test, even when it is still needed by another. I think maybe the truncation of the test name to test_interaction_with_local_ca0 may cause the collision.

So, I shortened the test name.

Ran 3 times, all passed.....

pjbull · 2023-12-10T19:40:31Z

@jayqi maybe this is good to go once this re-run passes? 🤯

Ran 3 times with no failures at ed56cc5.

Just cleaned up the debugging code from there. When this run finishes, it will be the 4th full suite to pass in a row.

jayqi · 2023-12-11T03:17:44Z

This was a good hypothesis, but from looking at the example from this issue, it seems like there is an incrementing integer that gets added after the truncation. So I think in theory we should be seeing test_interaction_with_local_ca0, test_interaction_with_local_ca1, ...

pjbull · 2023-12-11T03:35:41Z

Yeah, I wouldn't say I really have a handle on the actual root cause since I've still only ever been able to repro it on GHA and never locally on any OS.

It seems the directory is getting deleted out from under the test—probably by another process. As you point out, it is supposed to increment, but maybe some interaction between that and the fact we do our tests with multiprocessing and that we delete the cache directory for the client within the test itself make it so the increment code checks for existence first and doesn't increment. E.g., if it checks the dir exists but got deleted already in the test (or emptied by the test and some other process is cleaning up empty temp dirs). Even multiprocessing as the source doesn't make a ton of sense since the gw0 should isolate the dir to that worker.

The other test name related issue I thought about is that the paths get too long for some limit on the GH runners. At least one I checked was within ~15 characters of the Windows limit. Still, I would expect to be able to repro this on a Windows system then, and I wasn't.

The final piece of weirdness from this exploration is that it worked when I had fewer combinations of Python versions / OSes, then started failing again when I added them all back in.

The only other idea I have kicking around is that none of the debugging info we have gotten is actually diagnostic. It is due to some GHA resource availability glitch that happens somewhat randomly and any of the trends in the debugging runs is just noise.

So, all that said, I'd say let's merge this and 🤞. If it somehow at least makes the tests more stable, that's a win. The number of passes in a row with this change is better than we've gotten in a while.

jayqi · 2023-12-11T03:59:14Z

Just to post some experimentation:

def test_long_name_abcdefghijklmonpqrstuvwxyz(rig, tmpdir):
    assert False

❯ pytest tests/test_long_tmp_path.py -v | grep "tmpdir ="
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw0/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw1/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw5/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw3/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw2/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw4/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw0/test_long_name_abcdefghijklmon1')

So because of the multiprocessing, it does look like the integer value on test_long_name_abcdefghijklmon0 is per process and would otherwise collide, but the popen-gw0 is numbered by processor and so it should all actually be fine.

jayqi

Doesn't hurt, I guess. Maybe it would be helpful to put in a docstring with a more verbose description like the old test name though?

pjbull · 2024-02-17T22:25:05Z

Ugh, for the record just saw this again with a different rig for the same tests. (logs below):
https://github.com/drivendataorg/cloudpathlib/actions/runs/7944365402/job/21690147726#step:5:1145

Passing now with this outrageous commit:
f292ecb

Weirdly, with all the reading/writing we do in the test suite, this particular test continues to crop up this error on Windows sometimes.

_________________________ test_loc_dir[/local_s3_rig] _________________________
[gw0] win32 -- Python 3.12.2 C:\hostedtoolcache\windows\Python\3.12.2\x64\python.exe

rig = <tests.conftest.CloudProviderTestRig object at 0x00000169B91C70B0>
tmpdir = local('C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\popen-gw0\\test_loc_dir__local_s3_rig_0')

    def test_loc_dir(rig: CloudProviderTestRig, tmpdir):
        """Tests that local cache dir is used when specified and works'
        with the different cache modes.
    
        Used to be called `test_interaction_with_local_cache_dir` but
        maybe that test name caused problems (see #382).
        """
        # cannot instantiate persistent without local file dir
        with pytest.raises(InvalidConfigurationException):
            client = rig.client_class(
                file_cache_mode=FileCacheMode.persistent, **rig.required_client_kwargs
            )
    
        # automatically set to persitent if not specified
        client = rig.client_class(local_cache_dir=tmpdir, **rig.required_client_kwargs)
        assert client.file_cache_mode == FileCacheMode.persistent
    
        # test setting close_file explicitly works
        client = rig.client_class(
            local_cache_dir=tmpdir,
            file_cache_mode=FileCacheMode.close_file,
            **rig.required_client_kwargs,
        )
        cp = rig.create_cloud_path("dir_0/file0_0.txt", client=client)
        assert cp.client.file_cache_mode == FileCacheMode.close_file
    
        # download from cloud into the cache
        # must use open for close_file mode
        with cp.open("r") as f:
            _ = f.read()
    
        assert not cp._local.exists()
    
        # setting cloudpath_object still works
        client = rig.client_class(
            local_cache_dir=tmpdir,
            file_cache_mode=FileCacheMode.cloudpath_object,
            **rig.required_client_kwargs,
        )
        cp = rig.create_cloud_path("dir_0/file0_0.txt", client=client)
        assert cp.client.file_cache_mode == FileCacheMode.cloudpath_object
    
        # download from cloud into the cache
        with cp.open("r") as f:
            _ = f.read()
    
        assert cp._local.exists()
    
        cache_path = cp._local
        del cp
    
        assert not cache_path.exists()
    
        # setting tmp_dir still works
        client = rig.client_class(
            local_cache_dir=tmpdir, file_cache_mode=FileCacheMode.tmp_dir, **rig.required_client_kwargs
        )
        cp = rig.create_cloud_path("dir_0/file0_0.txt", client=client)
        assert cp.client.file_cache_mode == FileCacheMode.tmp_dir
    
        # download from cloud into the cache
>       with cp.open("r") as f:

tests\test_caching.py:241: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cloudpathlib\cloudpath.py:559: in open
    self._refresh_cache(force_overwrite_from_cloud=force_overwrite_from_cloud)
cloudpathlib\cloudpath.py:1121: in _refresh_cache
    self.download_to(self._local)
cloudpathlib\cloudpath.py:900: in download_to
    return self.client._download_file(self, destination)
cloudpathlib\local\localclient.py:66: in _download_file
    shutil.copyfile(self._cloud_path_to_local(cloud_path), local_path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

src = WindowsPath('C:/Users/RUNNER~1/AppData/Local/Temp/tmp5wpvexyu/bucket/GitAoVzvMXaBegZUVUefY8-test_caching-test_loc_dir/dir_0/file0_0.txt')
dst = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/popen-gw0/test_loc_dir__local_s3_rig_0/bucket/GitAoVzvMXaBegZUVUefY8-test_caching-test_loc_dir/dir_0/file0_0.txt')

    def copyfile(src, dst, *, follow_symlinks=True):
        """Copy data from src to dst in the most efficient way possible.
    
        If follow_symlinks is not set and src is a symbolic link, a new
        symlink will be created instead of copying the file it points to.
    
        """
        sys.audit("shutil.copyfile", src, dst)
    
        if _samefile(src, dst):
            raise SameFileError("{!r} and {!r} are the same file".format(src, dst))
    
        file_size = 0
        for i, fn in enumerate([src, dst]):
            try:
                st = _stat(fn)
            except OSError:
                # File most likely does not exist
                pass
            else:
                # XXX What about other special files? (sockets, devices...)
                if stat.S_ISFIFO(st.st_mode):
                    fn = fn.path if isinstance(fn, os.DirEntry) else fn
                    raise SpecialFileError("`%s` is a named pipe" % fn)
                if _WINDOWS and i == 0:
                    file_size = st.st_size
    
        if not follow_symlinks and _islink(src):
            os.symlink(os.readlink(src), dst)
        else:
            with open(src, 'rb') as fsrc:
                try:
>                   with open(dst, 'wb') as fdst:
E                   FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\popen-gw0\\test_loc_dir__local_s3_rig_0\\bucket\\GitAoVzvMXaBegZUVUefY8-test_caching-test_loc_dir\\dir_0\\file0_0.txt'

pjbull added 2 commits November 24, 2023 17:55

try fixing tests

1396d9c

imports

9ba0e95

github-actions bot temporarily deployed to pull request November 24, 2023 18:01 Inactive

pjbull requested a review from jayqi November 24, 2023 18:22

longer, plz

dd77ad9

github-actions bot temporarily deployed to pull request November 24, 2023 19:28 Inactive

double it

c452df2

github-actions bot temporarily deployed to pull request November 24, 2023 19:48 Inactive

debug code

11e0069

github-actions bot temporarily deployed to pull request November 25, 2023 18:18 Inactive

lint

e3b9bba

github-actions bot temporarily deployed to pull request November 25, 2023 18:22 Inactive

pjbull added 3 commits December 3, 2023 17:15

more debug logs

8aefb6c

just windows

bae0ac5

skip lint

bb08c64

github-actions bot temporarily deployed to pull request December 4, 2023 01:18 Inactive

github-actions bot temporarily deployed to pull request December 4, 2023 01:19 Inactive

no extra tests

b0445ed

github-actions bot temporarily deployed to pull request December 4, 2023 01:22 Inactive

all versions of py

5d132c7

github-actions bot temporarily deployed to pull request December 4, 2023 02:13 Inactive

with linting

d3cc302

github-actions bot temporarily deployed to pull request December 4, 2023 05:37 Inactive

add back in other oses

ffb99e6

github-actions bot temporarily deployed to pull request December 4, 2023 06:08 Inactive

shorten test name...

ed56cc5

github-actions bot temporarily deployed to pull request December 10, 2023 18:21 Inactive

revert unneeded changes

d8bfe44

github-actions bot temporarily deployed to pull request December 10, 2023 19:13 Inactive

imports and whitespace

73c9d3b

github-actions bot temporarily deployed to pull request December 10, 2023 19:15 Inactive

jayqi approved these changes Dec 11, 2023

View reviewed changes

add docstring to test

80ebc2c

github-actions bot temporarily deployed to pull request December 11, 2023 06:04 Inactive

pjbull merged commit c5eaf7e into master Dec 11, 2023
27 checks passed

pjbull deleted the flaky-test-write branch December 11, 2023 06:21

pjbull mentioned this pull request Dec 11, 2023

Failed build: tests #381

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky test test_interaction_with_local_cache_dir #382

Fix flaky test test_interaction_with_local_cache_dir #382

pjbull commented Nov 24, 2023 •

edited

Loading

github-actions bot commented Nov 24, 2023 •

edited

Loading

codecov bot commented Nov 24, 2023 •

edited

Loading

jayqi commented Nov 26, 2023

pjbull commented Dec 10, 2023 •

edited

Loading

pjbull commented Dec 10, 2023 •

edited

Loading

jayqi commented Dec 11, 2023

pjbull commented Dec 11, 2023 •

edited

Loading

jayqi commented Dec 11, 2023

jayqi left a comment

pjbull commented Feb 17, 2024

Fix flaky test test_interaction_with_local_cache_dir #382

Fix flaky test test_interaction_with_local_cache_dir #382

Conversation

pjbull commented Nov 24, 2023 • edited Loading

github-actions bot commented Nov 24, 2023 • edited Loading

codecov bot commented Nov 24, 2023 • edited Loading

Codecov Report

jayqi commented Nov 26, 2023

pjbull commented Dec 10, 2023 • edited Loading

pjbull commented Dec 10, 2023 • edited Loading

jayqi commented Dec 11, 2023

pjbull commented Dec 11, 2023 • edited Loading

jayqi commented Dec 11, 2023

jayqi left a comment

Choose a reason for hiding this comment

pjbull commented Feb 17, 2024

pjbull commented Nov 24, 2023 •

edited

Loading

github-actions bot commented Nov 24, 2023 •

edited

Loading

codecov bot commented Nov 24, 2023 •

edited

Loading

pjbull commented Dec 10, 2023 •

edited

Loading

pjbull commented Dec 10, 2023 •

edited

Loading

pjbull commented Dec 11, 2023 •

edited

Loading