Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test test_interaction_with_local_cache_dir #382

Merged
merged 17 commits into from
Dec 11, 2023
Merged

Conversation

pjbull
Copy link
Member

@pjbull pjbull commented Nov 24, 2023

Tried to repro the failing test (e.g., here and here) on a Windows machine, but the test suite passes consistently for me across dozens of executions.

Here's the stack trace from the failure:

tests\test_caching.py:221: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cloudpathlib\cloudpath.py:552: in open
    self._refresh_cache(force_overwrite_from_cloud=force_overwrite_from_cloud)
cloudpathlib\cloudpath.py:1120: in _refresh_cache
    self.download_to(self._local)
cloudpathlib\cloudpath.py:893: in download_to
    return self.client._download_file(self, destination)
cloudpathlib\s3\s3client.py:154: in _download_file
    obj.download_file(
tests\mock_clients\mock_s3.py:96: in download_file
    to_path.write_bytes(self.path.read_bytes())
C:\hostedtoolcache\windows\Python\3.8.10\x64\lib\pathlib.py:1245: in write_bytes
    with self.open(mode='wb') as f:
C:\hostedtoolcache\windows\Python\3.8.10\x64\lib\pathlib.py:1222: in open
    return io.open(self, mode, buffering, encoding, errors, newline,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/popen-gw0/test_interaction_with_local_ca0/bucket/L3YT2kNHDZHcaBmeV9uVJ4-test_caching-test_interaction_with_local_cache_dir/dir_0/file0_0.txt')
name = 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\popen-gw0\\test_interaction_with_local_ca0\\bucket\\L3YT2kNHDZHcaBmeV9uVJ4-test_caching-test_interaction_with_local_cache_dir\\dir_0\\file0_0.txt'
flags = 33665, mode = 438

    def _opener(self, name, flags, mode=0o666):
        # A stub for the opener argument to built-in open()
>       return self._accessor.open(self, flags, mode)
E       FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\popen-gw0\\test_interaction_with_local_ca0\\bucket\\L3YT2kNHDZHcaBmeV9uVJ4-test_caching-test_interaction_with_local_cache_dir\\dir_0\\file0_0.txt'

C:\hostedtoolcache\windows\Python\3.8.10\x64\lib\pathlib.py:1078: FileNotFoundError

From the error messages, we can see that to_path is the one that doesn't exist yet. If it is a race condition, the existing sleep in the tests is not at the right location for fixing the problem.

Instead, we should sleep between mkdir and write_bytes in the mock backends.

Made the same fix for S3 and GS. Azure has different mock backend code here.

Hopefully this is a more stable fix.


I've been re-running the GH actions suite to test stability of this change. With a 1s total wait, we saw the flaky error on the third run. Increased to a 5s total wait and am testing again.

Copy link
Contributor

github-actions bot commented Nov 24, 2023

@github-actions github-actions bot temporarily deployed to pull request November 24, 2023 18:01 Inactive
Copy link

codecov bot commented Nov 24, 2023

Codecov Report

Merging #382 (80ebc2c) into master (945a352) will increase coverage by 0.1%.
The diff coverage is n/a.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #382     +/-   ##
========================================
+ Coverage    93.7%   93.8%   +0.1%     
========================================
  Files          22      22             
  Lines        1563    1563             
========================================
+ Hits         1465    1467      +2     
+ Misses         98      96      -2     

see 1 file with indirect coverage changes

@pjbull pjbull requested a review from jayqi November 24, 2023 18:22
@github-actions github-actions bot temporarily deployed to pull request November 24, 2023 19:28 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 24, 2023 19:48 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 25, 2023 18:18 Inactive
@github-actions github-actions bot temporarily deployed to pull request November 25, 2023 18:22 Inactive
@jayqi
Copy link
Member

jayqi commented Nov 26, 2023

This is super bizarre, and I'm wondering if there is an obscure bug in Python or the Python standard library. The Windows runner is using an old version of Python 3.8 (3.8.10 from May 2021, vs. latest 3.8.18 stable on other runners)

@github-actions github-actions bot temporarily deployed to pull request December 4, 2023 01:18 Inactive
@github-actions github-actions bot temporarily deployed to pull request December 4, 2023 01:19 Inactive
@github-actions github-actions bot temporarily deployed to pull request December 4, 2023 01:22 Inactive
@github-actions github-actions bot temporarily deployed to pull request December 4, 2023 02:13 Inactive
@github-actions github-actions bot temporarily deployed to pull request December 4, 2023 05:37 Inactive
@github-actions github-actions bot temporarily deployed to pull request December 4, 2023 06:08 Inactive
@github-actions github-actions bot temporarily deployed to pull request December 10, 2023 18:21 Inactive
@pjbull
Copy link
Member Author

pjbull commented Dec 10, 2023

New hypothesis. Here is an example path:
/tmp/pytest-of-runner/pytest-0/popen-gw0/test_interaction_with_local_ca0/bucket/GZaejri6AmEPKPJCb6WNuC-test_caching-test_interaction_with_local_cache_dir/dir_0/file0_0.txt

I think maybe on different rigs we end up with race conditions up to /tmp/pytest-of-runner/pytest-0/popen-gw0/test_interaction_with_local_ca0/ so that dir gets deleted in the course of one test, even when it is still needed by another. I think maybe the truncation of the test name to test_interaction_with_local_ca0 may cause the collision.

So, I shortened the test name.

Ran 3 times, all passed.....

@github-actions github-actions bot temporarily deployed to pull request December 10, 2023 19:13 Inactive
@github-actions github-actions bot temporarily deployed to pull request December 10, 2023 19:15 Inactive
@pjbull
Copy link
Member Author

pjbull commented Dec 10, 2023

@jayqi maybe this is good to go once this re-run passes? 🤯

Ran 3 times with no failures at ed56cc5.

Just cleaned up the debugging code from there. When this run finishes, it will be the 4th full suite to pass in a row.

@jayqi
Copy link
Member

jayqi commented Dec 11, 2023

This was a good hypothesis, but from looking at the example from this issue, it seems like there is an incrementing integer that gets added after the truncation. So I think in theory we should be seeing test_interaction_with_local_ca0, test_interaction_with_local_ca1, ...

@pjbull
Copy link
Member Author

pjbull commented Dec 11, 2023

Yeah, I wouldn't say I really have a handle on the actual root cause since I've still only ever been able to repro it on GHA and never locally on any OS.

It seems the directory is getting deleted out from under the test—probably by another process. As you point out, it is supposed to increment, but maybe some interaction between that and the fact we do our tests with multiprocessing and that we delete the cache directory for the client within the test itself make it so the increment code checks for existence first and doesn't increment. E.g., if it checks the dir exists but got deleted already in the test (or emptied by the test and some other process is cleaning up empty temp dirs). Even multiprocessing as the source doesn't make a ton of sense since the gw0 should isolate the dir to that worker.

The other test name related issue I thought about is that the paths get too long for some limit on the GH runners. At least one I checked was within ~15 characters of the Windows limit. Still, I would expect to be able to repro this on a Windows system then, and I wasn't.

The final piece of weirdness from this exploration is that it worked when I had fewer combinations of Python versions / OSes, then started failing again when I added them all back in.

The only other idea I have kicking around is that none of the debugging info we have gotten is actually diagnostic. It is due to some GHA resource availability glitch that happens somewhat randomly and any of the trends in the debugging runs is just noise.

So, all that said, I'd say let's merge this and 🤞. If it somehow at least makes the tests more stable, that's a win. The number of passes in a row with this change is better than we've gotten in a while.

@jayqi
Copy link
Member

jayqi commented Dec 11, 2023

Just to post some experimentation:

def test_long_name_abcdefghijklmonpqrstuvwxyz(rig, tmpdir):
    assert False
❯ pytest tests/test_long_tmp_path.py -v | grep "tmpdir ="
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw0/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw1/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw5/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw3/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw2/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw4/test_long_name_abcdefghijklmon0')
tmpdir = local('/private/var/folders/nq/vp3dgt812jgb0q09rh5l706c0000gn/T/pytest-of-jqi/pytest-8/popen-gw0/test_long_name_abcdefghijklmon1')

So because of the multiprocessing, it does look like the integer value on test_long_name_abcdefghijklmon0 is per process and would otherwise collide, but the popen-gw0 is numbered by processor and so it should all actually be fine.

Copy link
Member

@jayqi jayqi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't hurt, I guess. Maybe it would be helpful to put in a docstring with a more verbose description like the old test name though?

@github-actions github-actions bot temporarily deployed to pull request December 11, 2023 06:04 Inactive
@pjbull pjbull merged commit c5eaf7e into master Dec 11, 2023
27 checks passed
@pjbull pjbull deleted the flaky-test-write branch December 11, 2023 06:21
@pjbull pjbull mentioned this pull request Dec 11, 2023
@pjbull
Copy link
Member Author

pjbull commented Feb 17, 2024

Ugh, for the record just saw this again with a different rig for the same tests. (logs below):
https://github.com/drivendataorg/cloudpathlib/actions/runs/7944365402/job/21690147726#step:5:1145

Passing now with this outrageous commit:
f292ecb

Weirdly, with all the reading/writing we do in the test suite, this particular test continues to crop up this error on Windows sometimes.

_________________________ test_loc_dir[/local_s3_rig] _________________________
[gw0] win32 -- Python 3.12.2 C:\hostedtoolcache\windows\Python\3.12.2\x64\python.exe

rig = <tests.conftest.CloudProviderTestRig object at 0x00000169B91C70B0>
tmpdir = local('C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\popen-gw0\\test_loc_dir__local_s3_rig_0')

    def test_loc_dir(rig: CloudProviderTestRig, tmpdir):
        """Tests that local cache dir is used when specified and works'
        with the different cache modes.
    
        Used to be called `test_interaction_with_local_cache_dir` but
        maybe that test name caused problems (see #382).
        """
        # cannot instantiate persistent without local file dir
        with pytest.raises(InvalidConfigurationException):
            client = rig.client_class(
                file_cache_mode=FileCacheMode.persistent, **rig.required_client_kwargs
            )
    
        # automatically set to persitent if not specified
        client = rig.client_class(local_cache_dir=tmpdir, **rig.required_client_kwargs)
        assert client.file_cache_mode == FileCacheMode.persistent
    
        # test setting close_file explicitly works
        client = rig.client_class(
            local_cache_dir=tmpdir,
            file_cache_mode=FileCacheMode.close_file,
            **rig.required_client_kwargs,
        )
        cp = rig.create_cloud_path("dir_0/file0_0.txt", client=client)
        assert cp.client.file_cache_mode == FileCacheMode.close_file
    
        # download from cloud into the cache
        # must use open for close_file mode
        with cp.open("r") as f:
            _ = f.read()
    
        assert not cp._local.exists()
    
        # setting cloudpath_object still works
        client = rig.client_class(
            local_cache_dir=tmpdir,
            file_cache_mode=FileCacheMode.cloudpath_object,
            **rig.required_client_kwargs,
        )
        cp = rig.create_cloud_path("dir_0/file0_0.txt", client=client)
        assert cp.client.file_cache_mode == FileCacheMode.cloudpath_object
    
        # download from cloud into the cache
        with cp.open("r") as f:
            _ = f.read()
    
        assert cp._local.exists()
    
        cache_path = cp._local
        del cp
    
        assert not cache_path.exists()
    
        # setting tmp_dir still works
        client = rig.client_class(
            local_cache_dir=tmpdir, file_cache_mode=FileCacheMode.tmp_dir, **rig.required_client_kwargs
        )
        cp = rig.create_cloud_path("dir_0/file0_0.txt", client=client)
        assert cp.client.file_cache_mode == FileCacheMode.tmp_dir
    
        # download from cloud into the cache
>       with cp.open("r") as f:

tests\test_caching.py:241: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cloudpathlib\cloudpath.py:559: in open
    self._refresh_cache(force_overwrite_from_cloud=force_overwrite_from_cloud)
cloudpathlib\cloudpath.py:1121: in _refresh_cache
    self.download_to(self._local)
cloudpathlib\cloudpath.py:900: in download_to
    return self.client._download_file(self, destination)
cloudpathlib\local\localclient.py:66: in _download_file
    shutil.copyfile(self._cloud_path_to_local(cloud_path), local_path)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

src = WindowsPath('C:/Users/RUNNER~1/AppData/Local/Temp/tmp5wpvexyu/bucket/GitAoVzvMXaBegZUVUefY8-test_caching-test_loc_dir/dir_0/file0_0.txt')
dst = WindowsPath('C:/Users/runneradmin/AppData/Local/Temp/pytest-of-runneradmin/pytest-0/popen-gw0/test_loc_dir__local_s3_rig_0/bucket/GitAoVzvMXaBegZUVUefY8-test_caching-test_loc_dir/dir_0/file0_0.txt')

    def copyfile(src, dst, *, follow_symlinks=True):
        """Copy data from src to dst in the most efficient way possible.
    
        If follow_symlinks is not set and src is a symbolic link, a new
        symlink will be created instead of copying the file it points to.
    
        """
        sys.audit("shutil.copyfile", src, dst)
    
        if _samefile(src, dst):
            raise SameFileError("{!r} and {!r} are the same file".format(src, dst))
    
        file_size = 0
        for i, fn in enumerate([src, dst]):
            try:
                st = _stat(fn)
            except OSError:
                # File most likely does not exist
                pass
            else:
                # XXX What about other special files? (sockets, devices...)
                if stat.S_ISFIFO(st.st_mode):
                    fn = fn.path if isinstance(fn, os.DirEntry) else fn
                    raise SpecialFileError("`%s` is a named pipe" % fn)
                if _WINDOWS and i == 0:
                    file_size = st.st_size
    
        if not follow_symlinks and _islink(src):
            os.symlink(os.readlink(src), dst)
        else:
            with open(src, 'rb') as fsrc:
                try:
>                   with open(dst, 'wb') as fdst:
E                   FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\runneradmin\\AppData\\Local\\Temp\\pytest-of-runneradmin\\pytest-0\\popen-gw0\\test_loc_dir__local_s3_rig_0\\bucket\\GitAoVzvMXaBegZUVUefY8-test_caching-test_loc_dir\\dir_0\\file0_0.txt'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants